Large-scale Single-pass k-Means Clustering
Download
Report
Transcript Large-scale Single-pass k-Means Clustering
Large-scale Single-pass k-Means
Clustering at Scale
©MapR Technologies - Confidential
1
Large-scale Single-pass k-Means
Clustering
©MapR Technologies - Confidential
2
Large-scale k-Means Clustering
©MapR Technologies - Confidential
3
Goals
Cluster very large data sets
Facilitate large nearest neighbor search
Allow very large number of clusters
Achieve good quality
–
low average distance to nearest centroid on held-out data
Based on Mahout Math
Runs on Hadoop (really MapR) cluster
FAST – cluster tens of millions in minutes
©MapR Technologies - Confidential
4
Non-goals
Use map-reduce (but it is there)
Minimize the number of clusters
Support metrics other than L2
©MapR Technologies - Confidential
5
Anti-goals
Multiple passes over original data
Scale as O(k n)
©MapR Technologies - Confidential
6
Why?
©MapR Technologies - Confidential
7
K-nearest Neighbor with
Super Fast k-means
©MapR Technologies - Confidential
8
What’s that?
Find the k nearest training examples
Use the average value of the target variable from them
This is easy … but hard
–
–
–
easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
hard because of the stunning amount of math
also hard because we need top 50,000 results
Initial prototype was massively too slow
–
–
3K queries x 200K examples takes hours
needed 20M x 25M in the same time
©MapR Technologies - Confidential
9
How We Did It
2 week hackathon with 6 developers from customer bank
Agile-ish development
To avoid IP issues
–
–
–
–
all code is Apache Licensed (no ownership question)
all data is synthetic (no question of private data)
all development done on individual machines, hosting on Github
open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
©MapR Technologies - Confidential
10
How We Did It
2 week hackathon with 6 developers from customer bank
Agile-ish development
To avoid IP issues
–
–
–
–
all code is Apache Licensed (no ownership question)
all data is synthetic (no question of private data)
all development done on individual machines, hosting on Github
open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
–
well, really only 100-1000x after basic hygiene
©MapR Technologies - Confidential
11
What We Did
Mechanism for extending Mahout Vectors
–
Shared memory matrix
–
FileBasedMatrix uses mmap to share very large dense matrices
Searcher interface
–
DelegatingVector, WeightedVector, Centroid
ProjectionSearch, KmeansSearch, LshSearch, Brute
Super-fast clustering
–
Kmeans, StreamingKmeans
©MapR Technologies - Confidential
12
Projection Search
java.lang.TreeSet!
©MapR Technologies - Confidential
13
How Many Projections?
©MapR Technologies - Confidential
14
K-means Search
Simple Idea
–
–
pre-cluster the data
to find the nearest points, search the nearest clusters
Recursive application
–
to search a cluster, use a Searcher!
©MapR Technologies - Confidential
15
©MapR Technologies - Confidential
16
x
©MapR Technologies - Confidential
17
©MapR Technologies - Confidential
18
©MapR Technologies - Confidential
19
x
©MapR Technologies - Confidential
20
But This Requires k-means!
Need a new k-means algorithm to get speed
–
–
–
Hadoop is very slow at iterative map-reduce
Maybe Pregel clones like Giraph would be better
Or maybe not
Streaming k-means is
–
–
–
One pass (through the original data)
Very fast (20 us per data point with threads)
Very parallelizable
©MapR Technologies - Confidential
21
Basic Method
Use a single pass of k-means with very many clusters
–
output is a bad-ish clustering but a good surrogate
Use weighted centroids from step 1 to do in-memory clustering
–
output is a good clustering with fewer clusters
©MapR Technologies - Confidential
22
Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid
if number of centroids > 10 log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease
©MapR Technologies - Confidential
23
How It Works
Result is large set of centroids
–
–
–
these provide approximation of original distribution
we can cluster centroids to get a close approximation of clustering original
or we can just use the result directly
©MapR Technologies - Confidential
24
Parallel Speedup?
200
✓
Non- threaded
Tim e per point (μs)
100
2
Threaded version
3
50
4
40
6
5
30
8
10
14
12
Perfect Scaling
20
16
10
1
2
3
4
5
Threads
©MapR Technologies - Confidential
25
20
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
©MapR Technologies - Confidential
26
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
©MapR Technologies - Confidential
27
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
Empirically, projection search beats 64 bit LSH by a bit
©MapR Technologies - Confidential
28
Moving to Scale
Map-reduce implementation nearly trivial
Map: rough-cluster input data, output ß, weighted centroids
Reduce:
–
–
–
single reducer gets all centroids
if too many centroids, merge using recursive clustering
optionally do final clustering in-memory
Combiner possible, but essentially never important
©MapR Technologies - Confidential
29
Contact:
–
–
[email protected]
@ted_dunning
Slides and such:
–
http://info.mapr.com/ted-mlconf.html
Hash tags: #mlconf #mahout #mapr
©MapR Technologies - Confidential
30