Transcript ppt - HEA

Fast Algorithms
for Analyzing Massive Data
Alexander Gray
Georgia Institute of Technology
www.fast-lab.org
The FASTlab
Fundamental Algorithmic and Statistical Tools Laboratory
www.fast-lab.org
1.
Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS
2.
Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Dongryeol Lee: PhD student, CS + Math
Ryan Riegel: PhD student, CS + Math
Sooraj Bhat: PhD student, CS
Nishant Mehta: PhD student, CS
Parikshit Ram: PhD student, CS + Math
William March: PhD student, Math + CS
Hua Ouyang: PhD student, CS
Ravi Sastry: PhD student, CS
Long Tran: PhD student, CS
Ryan Curtin: PhD student, EE
Ailar Javadi: PhD student, EE
Anita Zakrzewska: PhD student, CS
+ 5-10 MS students and undergraduates
7 tasks of
machine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search
2.
3.
4.
5.
6.
7.
O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)
Density estimation: mixture of Gaussians, kernel density
estimation O(N2), kernel conditional density estimation O(N3)
Classification: decision tree, nearest-neighbor classifier O(N2),
kernel discriminant analysis O(N2), support vector machine O(N3) ,
Lp SVM
Regression: linear regression, LASSO, kernel regression O(N2),
Gaussian process regression O(N3)
Dimension reduction: PCA, non-negative matrix factorization,
kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian
graphical models, discrete graphical models
Clustering: k-means, mean-shift O(N2), hierarchical (FoF)
clustering O(N3)
Testing and matching: MST O(N3), bipartite cross-matching
O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding
7 tasks of
machine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search
2.
3.
4.
5.
6.
7.
O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)
Density estimation: mixture of Gaussians, kernel density
estimation O(N2), kernel conditional density estimation O(N3)
Classification: decision tree, nearest-neighbor classifier O(N2),
kernel discriminant analysis O(N2), support vector machine O(N3), Lp
SVM
Regression: linear regression, LASSO, kernel regression O(N2),
Gaussian process regression O(N3)
Dimension reduction: PCA, non-negative matrix factorization,
kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian
graphical models, discrete graphical models
Clustering: k-means, mean-shift O(N2), hierarchical (FoF)
clustering O(N3)
Testing and matching: MST O(N3), bipartite cross-matching
O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding
7 tasks of
machine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest2.
3.
4.
5.
6.
7.
neighbor O(N), all-nearest-neighbors O(N2)
Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel
conditional density estimation O(N3), submanifold density estimation [Ozakin
& Gray, NIPS 2010], O(N3), convex adaptive kernel estimation [Sastry &
Gray, AISTATS 2011] O(N4)
Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant
analysis O(N2), support vector machine O(N3) , Lp SVM, non-negative SVM
[Guan et al, 2011]
Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process
regression O(N3)
Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3),
maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical
models, rank-preserving maps [Ouyang and Gray, ICML 2008] O(N3);
isometric separation maps [Vasiiloglou, Gray, and Anderson MLSP 2009]
O(N3); isometric NMF [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3);
functional ICA [Mehta and Gray, 2009], density preserving maps [Ozakin
and Gray, in prep] O(N3)
Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)
Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point
correlation 2-sample testing O(Nn), kernel embedding
7 tasks of
machine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search
2.
3.
4.
5.
6.
7.
O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)
Density estimation: mixture of Gaussians, kernel density
estimation O(N2), kernel conditional density estimation O(N3)
Classification: decision tree, nearest-neighbor classifier O(N2),
kernel discriminant analysis O(N2), support vector machine O(N3) ,
Lp SVM
Computational
Regression: linear regression, kernel regression O(N2),
Problem!
Gaussian process regression
O(N3), LASSO
Dimension reduction: PCA, non-negative matrix factorization,
kernel PCA O(N3), maximum variance unfolding O(N3), Gaussian
graphical models, discrete graphical models
Clustering: k-means, mean-shift O(N2), hierarchical (FoF)
clustering O(N3)
Testing and matching: MST O(N3), bipartite cross-matching
O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding
The “7 Giants” of Data
(computational problem types)
[Gray, Indyk, Mahoney, Szalay, in National Acad of Sci
Report on Analysis of Massive Data, in prep]
1. Basic statistics: means, covariances, etc.
2. Generalized N-body problems: distances,
geometry
3.
4.
5.
6.
7.
Graph-theoretic problems: discrete graphs
Linear-algebraic problems: matrix operations
Optimizations: unconstrained, convex
Integrations: general dimension
Alignment problems: dynamic prog, matching
7 general strategies
1.
2.
3.
4.
5.
6.
7.
Divide and conquer / indexing (trees)
Function transforms (series)
Sampling (Monte Carlo, active learning)
Locality (caching)
Streaming (online)
Parallelism (clusters, GPUs)
Problem transformation
(reformulations)
1. Divide and conquer
• Fastest approach for:
– nearest neighbor, range search (exact) ~O(logN) [Bentley
1970], all-nearest-neighbors (exact) O(N) [Gray & Moore, NIPS
2000], [Ram, Lee, March, Gray, NIPS 2010], anytime nearest neighbor
(exact) [Ram & Gray, SDM 2012], max inner product [Ram & Gray,
under review]
– mixture of Gaussians [Moore, NIPS 1999], k-means [Pelleg and
Moore, KDD 1999], mean-shift clustering O(N) [Lee & Gray, AISTATS
2009], hierarchical clustering (single linkage, friends-of-friends)
O(NlogN) [March & Gray, KDD 2010]
– nearest neighbor classification [Liu, Moore, Gray, NIPS 2004],
kernel discriminant analysis O(N) [Riegel & Gray, SDM 2008]
– n-point correlation functions ~O(Nlogn) [Gray & Moore, NIPS 2000],
[Moore et al. Mining the Sky 2000], multi-matcher jackknifed npcf [March
& Gray, under review]
3-point correlation
(biggest previous:
20K)
n=2: O(N)
n=3: O(Nlog3)
VIRGO
simulation data,
N = 75,000,000
naïve: 5x109 sec.
(~150 years)
multi-tree: 55 sec.
(exact)
n=4: O(N2)
3-point correlation
Naive - O(Nn)
(estimated)
2 point cor.
100 matchers
3 point cor.
243 matchers
Single
Multi-bandwidth
bandwidth
[March & Gray
[Gray & Moore
in prep 2010]
2000, Moore et
new
al. 2000]
2.0 x 107 s 352.8 s
56,000
1.1 x 1011
s
891.6 s
1.23 x 108
4.96 s
71.1
13.58 s
65.6
2.3 x 1014 14530 s
503.6 s
4 point
cor.
6
points, galaxy
s simulation
1.58 xdata
21610
matchers
28.8
1010
2. Function transforms
• Fastest approach for:
– Kernel estimation (low-ish
dimension): dual-tree fast
Gauss transforms
(multipole/Hermite
expansions) [Lee, Gray, Moore
NIPS 2005], [Lee and Gray, UAI
2006]
– KDE and GP (kernel density
estimation, Gaussian process
regression) (high-D): random
Fourier functions [Lee and Gray,
in prep]
3. Sampling
• Fastest approach for (approximate):
– PCA: cosine trees [Holmes, Gray, Isbell, NIPS 2008]
– Kernel estimation: bandwidth learning [Holmes, Gray, Isbell, NIPS
2006], [Holmes, Gray, Isbell, UAI 2007], Monte Carlo multipole method
(with SVD trees) [Lee & Gray, NIPS 2009]
– Nearest-neighbor: distance-approx: spill trees with random proj:
[Liu, Moore, Gray, Yang, NIPS 2004], rank-approximate: [Ram, Ouyang,
Gray, NIPS 2009]
e=0%(exact),0.001%,0.01%,0.1%,1%,10%
a=0.95
4
Rank-approximate NN:
• Best meaning-retaining
approximation criterion in the face
of high-dimensional distances
• More accurate than LSH
speedup over linear search
10
3
10
2
10
1
10
0
10
bio
corel
covtype
images
mnist
phy
urand
3. Sampling
• Active learning: the
sampling can depend on
previous samples
– Linear classifiers:
rigorous framework for
pool-based active learning
30
UPAL
BMAL
VW
RAL
PL
25
20
15
[Sastry and Gray, AISTATS 2012]
• Empirically allows reduction
in the number of objects that
require labeling
• Theoretical rigor:
unbiasedness
10
5
0
50
100
150
200
250
300
4. Caching
• Fastest approach for (using disk):
– Nearest-neighbor, 2-point: Disk-based treee
algorithms in Microsoft SQL Server [Riegel, Aditya,
Budavari, Gray, in prep]
• Builds kd-tree on top of built-in B-trees
• Fixed-pass algorithm to build kd-tree
No. of points
MLDB (Dual tree)
Naive
40,000
8 seconds
159 seconds
200,000
43 seconds
3480 seconds
2,000,000
297 seconds
80 hours
10,000,000
29 mins 27 sec
74 days
20,000,000
58mins 48sec
280 days
40,000,000
112m 32 sec
2 years
5. Streaming / online
• Fastest approach for (approximate, or
streaming):
– Online learning/stochastic optimization: just use the
current sample to update the gradient
• SVM (squared hinge loss): stochastic Frank-Wolfe [Ouyang and
Gray, SDM 2010]
• SVM, LASSO, et al.: noise-adaptive stochastic
approximation [Ouyang and Gray, in prep, on arxiv], accelerated
non-smooth SGD [Ouyang and Gray, under review]
– faster than SGD
– solves step size problem
– beats all existing convergence rates
6. Parallelism
• Fastest approach for (using many machines):
– KDE, GP, n-point: distributed trees [Lee and Gray, SDM 2012], 6000+
cores; [March et al, in prep for Gordon Bell Prize 2012], 100K cores?
• Each process owns the global tree and its local tree
• First log p levels built in parallel; each process determines where to send data
• Asynchronous averaging; provable convergence
– SVM, LASSO, et al.: distributed online optimization [Ouyang and Gray,
in prep, on arxiv]
• Provable theoretical speedup for the first time
P0
P0
P1
P2
P3
P0
P0
P1
P0
P1
P2
P3
P1
P2
P1
P2
P2
P3
P4
P5
P4
P3
P3
P4
P6
P7
P4
P5
P6
P7
P5
P6
P5
P6
P7
P7
7. Transformations
between problems
• Change the problem type:
–
–
–
Linear algebra on kernel matrices  N-body inside conjugate gradient [Gray, TR 2004]
Euclidean graphs  N-body problems [March & Gray, KDD 2010]
HMM as graph  matrix factorization [Tran & Gray, in prep]
• Optimizations: reformulate the objective and constraints:
–
–
–
–
Maximum variance unfolding: SDP via Burer-Monteiro convex relaxation [Vasiloglou, Gray,
Anderson MLSP 2009]
Lq SVM, 0<q<1: DC programming [Guan & Gray, CSDA 2011]
L0 SVM: mixed integer nonlinear program via perspective cuts [Guan & Gray, under review]
Do reformulations automatically [Agarwal et al, PADL 2010], [Bhat et al, POPL 2012]
• Create new ML methods with desired computational properties:
–
–
–
Density estimation trees: nonparametric density estimation, O(NlogN) [Ram & Gray, KDD
2011]
Local linear SVMs: nonlinear classification, O(NlogN) [Sastry & Gray, under review]
Discriminative local coding: nonlinear classification O(NlogN) [Mehta & Gray, under
review]
Software
• For academic use only: MLPACK
– Open source, C++, written by students
– Data must fit in RAM: distributed in progress
• For institutions: Skytree Server
– First commercial-grade high-performance machine learning server
– Fastest, biggest ML available: up to 10,000x faster than existing
solutions (on one machine)
– V.12, April 2012-ish: distributed, streaming
– Connects to stats packages, Matlab, DBMS, Python, etc
– www.skytreecorp.com
– Colleagues: Email me to try it out: [email protected]