#### Transcript ppt - HEA

Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology www.fast-lab.org The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory www.fast-lab.org 1. Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS 2. Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Dongryeol Lee: PhD student, CS + Math Ryan Riegel: PhD student, CS + Math Sooraj Bhat: PhD student, CS Nishant Mehta: PhD student, CS Parikshit Ram: PhD student, CS + Math William March: PhD student, Math + CS Hua Ouyang: PhD student, CS Ravi Sastry: PhD student, CS Long Tran: PhD student, CS Ryan Curtin: PhD student, EE Ailar Javadi: PhD student, EE Anita Zakrzewska: PhD student, CS + 5-10 MS students and undergraduates 7 tasks of machine learning / data mining 1. Querying: spherical range-search O(N), orthogonal range-search 2. 3. 4. 5. 6. 7. O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2) Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3) Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3) Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding 7 tasks of machine learning / data mining 1. Querying: spherical range-search O(N), orthogonal range-search 2. 3. 4. 5. 6. 7. O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2) Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3), Lp SVM Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3) Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3) Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding 7 tasks of machine learning / data mining 1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest2. 3. 4. 5. 6. 7. neighbor O(N), all-nearest-neighbors O(N2) Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3), submanifold density estimation [Ozakin & Gray, NIPS 2010], O(N3), convex adaptive kernel estimation [Sastry & Gray, AISTATS 2011] O(N4) Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM, non-negative SVM [Guan et al, 2011] Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3) Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models, rank-preserving maps [Ouyang and Gray, ICML 2008] O(N3); isometric separation maps [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); isometric NMF [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); functional ICA [Mehta and Gray, 2009], density preserving maps [Ozakin and Gray, in prep] O(N3) Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3) Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding 7 tasks of machine learning / data mining 1. Querying: spherical range-search O(N), orthogonal range-search 2. 3. 4. 5. 6. 7. O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2) Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM Computational Regression: linear regression, kernel regression O(N2), Problem! Gaussian process regression O(N3), LASSO Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3), Gaussian graphical models, discrete graphical models Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3) Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding The “7 Giants” of Data (computational problem types) [Gray, Indyk, Mahoney, Szalay, in National Acad of Sci Report on Analysis of Massive Data, in prep] 1. Basic statistics: means, covariances, etc. 2. Generalized N-body problems: distances, geometry 3. 4. 5. 6. 7. Graph-theoretic problems: discrete graphs Linear-algebraic problems: matrix operations Optimizations: unconstrained, convex Integrations: general dimension Alignment problems: dynamic prog, matching 7 general strategies 1. 2. 3. 4. 5. 6. 7. Divide and conquer / indexing (trees) Function transforms (series) Sampling (Monte Carlo, active learning) Locality (caching) Streaming (online) Parallelism (clusters, GPUs) Problem transformation (reformulations) 1. Divide and conquer • Fastest approach for: – nearest neighbor, range search (exact) ~O(logN) [Bentley 1970], all-nearest-neighbors (exact) O(N) [Gray & Moore, NIPS 2000], [Ram, Lee, March, Gray, NIPS 2010], anytime nearest neighbor (exact) [Ram & Gray, SDM 2012], max inner product [Ram & Gray, under review] – mixture of Gaussians [Moore, NIPS 1999], k-means [Pelleg and Moore, KDD 1999], mean-shift clustering O(N) [Lee & Gray, AISTATS 2009], hierarchical clustering (single linkage, friends-of-friends) O(NlogN) [March & Gray, KDD 2010] – nearest neighbor classification [Liu, Moore, Gray, NIPS 2004], kernel discriminant analysis O(N) [Riegel & Gray, SDM 2008] – n-point correlation functions ~O(Nlogn) [Gray & Moore, NIPS 2000], [Moore et al. Mining the Sky 2000], multi-matcher jackknifed npcf [March & Gray, under review] 3-point correlation (biggest previous: 20K) n=2: O(N) n=3: O(Nlog3) VIRGO simulation data, N = 75,000,000 naïve: 5x109 sec. (~150 years) multi-tree: 55 sec. (exact) n=4: O(N2) 3-point correlation Naive - O(Nn) (estimated) 2 point cor. 100 matchers 3 point cor. 243 matchers Single Multi-bandwidth bandwidth [March & Gray [Gray & Moore in prep 2010] 2000, Moore et new al. 2000] 2.0 x 107 s 352.8 s 56,000 1.1 x 1011 s 891.6 s 1.23 x 108 4.96 s 71.1 13.58 s 65.6 2.3 x 1014 14530 s 503.6 s 4 point cor. 6 points, galaxy s simulation 1.58 xdata 21610 matchers 28.8 1010 2. Function transforms • Fastest approach for: – Kernel estimation (low-ish dimension): dual-tree fast Gauss transforms (multipole/Hermite expansions) [Lee, Gray, Moore NIPS 2005], [Lee and Gray, UAI 2006] – KDE and GP (kernel density estimation, Gaussian process regression) (high-D): random Fourier functions [Lee and Gray, in prep] 3. Sampling • Fastest approach for (approximate): – PCA: cosine trees [Holmes, Gray, Isbell, NIPS 2008] – Kernel estimation: bandwidth learning [Holmes, Gray, Isbell, NIPS 2006], [Holmes, Gray, Isbell, UAI 2007], Monte Carlo multipole method (with SVD trees) [Lee & Gray, NIPS 2009] – Nearest-neighbor: distance-approx: spill trees with random proj: [Liu, Moore, Gray, Yang, NIPS 2004], rank-approximate: [Ram, Ouyang, Gray, NIPS 2009] e=0%(exact),0.001%,0.01%,0.1%,1%,10% a=0.95 4 Rank-approximate NN: • Best meaning-retaining approximation criterion in the face of high-dimensional distances • More accurate than LSH speedup over linear search 10 3 10 2 10 1 10 0 10 bio corel covtype images mnist phy urand 3. Sampling • Active learning: the sampling can depend on previous samples – Linear classifiers: rigorous framework for pool-based active learning 30 UPAL BMAL VW RAL PL 25 20 15 [Sastry and Gray, AISTATS 2012] • Empirically allows reduction in the number of objects that require labeling • Theoretical rigor: unbiasedness 10 5 0 50 100 150 200 250 300 4. Caching • Fastest approach for (using disk): – Nearest-neighbor, 2-point: Disk-based treee algorithms in Microsoft SQL Server [Riegel, Aditya, Budavari, Gray, in prep] • Builds kd-tree on top of built-in B-trees • Fixed-pass algorithm to build kd-tree No. of points MLDB (Dual tree) Naive 40,000 8 seconds 159 seconds 200,000 43 seconds 3480 seconds 2,000,000 297 seconds 80 hours 10,000,000 29 mins 27 sec 74 days 20,000,000 58mins 48sec 280 days 40,000,000 112m 32 sec 2 years 5. Streaming / online • Fastest approach for (approximate, or streaming): – Online learning/stochastic optimization: just use the current sample to update the gradient • SVM (squared hinge loss): stochastic Frank-Wolfe [Ouyang and Gray, SDM 2010] • SVM, LASSO, et al.: noise-adaptive stochastic approximation [Ouyang and Gray, in prep, on arxiv], accelerated non-smooth SGD [Ouyang and Gray, under review] – faster than SGD – solves step size problem – beats all existing convergence rates 6. Parallelism • Fastest approach for (using many machines): – KDE, GP, n-point: distributed trees [Lee and Gray, SDM 2012], 6000+ cores; [March et al, in prep for Gordon Bell Prize 2012], 100K cores? • Each process owns the global tree and its local tree • First log p levels built in parallel; each process determines where to send data • Asynchronous averaging; provable convergence – SVM, LASSO, et al.: distributed online optimization [Ouyang and Gray, in prep, on arxiv] • Provable theoretical speedup for the first time P0 P0 P1 P2 P3 P0 P0 P1 P0 P1 P2 P3 P1 P2 P1 P2 P2 P3 P4 P5 P4 P3 P3 P4 P6 P7 P4 P5 P6 P7 P5 P6 P5 P6 P7 P7 7. Transformations between problems • Change the problem type: – – – Linear algebra on kernel matrices N-body inside conjugate gradient [Gray, TR 2004] Euclidean graphs N-body problems [March & Gray, KDD 2010] HMM as graph matrix factorization [Tran & Gray, in prep] • Optimizations: reformulate the objective and constraints: – – – – Maximum variance unfolding: SDP via Burer-Monteiro convex relaxation [Vasiloglou, Gray, Anderson MLSP 2009] Lq SVM, 0<q<1: DC programming [Guan & Gray, CSDA 2011] L0 SVM: mixed integer nonlinear program via perspective cuts [Guan & Gray, under review] Do reformulations automatically [Agarwal et al, PADL 2010], [Bhat et al, POPL 2012] • Create new ML methods with desired computational properties: – – – Density estimation trees: nonparametric density estimation, O(NlogN) [Ram & Gray, KDD 2011] Local linear SVMs: nonlinear classification, O(NlogN) [Sastry & Gray, under review] Discriminative local coding: nonlinear classification O(NlogN) [Mehta & Gray, under review] Software • For academic use only: MLPACK – Open source, C++, written by students – Data must fit in RAM: distributed in progress • For institutions: Skytree Server – First commercial-grade high-performance machine learning server – Fastest, biggest ML available: up to 10,000x faster than existing solutions (on one machine) – V.12, April 2012-ish: distributed, streaming – Connects to stats packages, Matlab, DBMS, Python, etc – www.skytreecorp.com – Colleagues: Email me to try it out: [email protected]