Large-scale Single-pass k-Means Clustering

Transcript Large-scale Single-pass k-Means Clustering

Large-scale Single-pass k-Means
Clustering at Scale
©MapR Technologies - Confidential
1
Large-scale Single-pass k-Means
Clustering
©MapR Technologies - Confidential
2
Large-scale k-Means Clustering
©MapR Technologies - Confidential
3
Goals

Cluster very large data sets

Facilitate large nearest neighbor search
Allow very large number of clusters
Achieve good quality


–



low average distance to nearest centroid on held-out data
Based on Mahout Math
Runs on Hadoop (really MapR) cluster
FAST – cluster tens of millions in minutes
©MapR Technologies - Confidential
4
Non-goals

Use map-reduce (but it is there)

Minimize the number of clusters
Support metrics other than L2

©MapR Technologies - Confidential
5
Anti-goals

Multiple passes over original data

Scale as O(k n)
©MapR Technologies - Confidential
6
Why?
©MapR Technologies - Confidential
7
K-nearest Neighbor with
Super Fast k-means
©MapR Technologies - Confidential
8
What’s that?

Find the k nearest training examples

Use the average value of the target variable from them

This is easy … but hard
–
–
–

easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
hard because of the stunning amount of math
also hard because we need top 50,000 results
Initial prototype was massively too slow
–
–
3K queries x 200K examples takes hours
needed 20M x 25M in the same time
©MapR Technologies - Confidential
9
How We Did It

2 week hackathon with 6 developers from customer bank

Agile-ish development
To avoid IP issues

–
–
–
–
all code is Apache Licensed (no ownership question)
all data is synthetic (no question of private data)
all development done on individual machines, hosting on Github
open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup
©MapR Technologies - Confidential
10
How We Did It

2 week hackathon with 6 developers from customer bank

Agile-ish development
To avoid IP issues

–
–
–
–
all code is Apache Licensed (no ownership question)
all data is synthetic (no question of private data)
all development done on individual machines, hosting on Github
open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup
–
well, really only 100-1000x after basic hygiene
©MapR Technologies - Confidential
11
What We Did

Mechanism for extending Mahout Vectors
–

Shared memory matrix
–

FileBasedMatrix uses mmap to share very large dense matrices
Searcher interface
–

DelegatingVector, WeightedVector, Centroid
ProjectionSearch, KmeansSearch, LshSearch, Brute
Super-fast clustering
–
Kmeans, StreamingKmeans
©MapR Technologies - Confidential
12
Projection Search
java.lang.TreeSet!
©MapR Technologies - Confidential
13
How Many Projections?
©MapR Technologies - Confidential
14
K-means Search

Simple Idea
–
–

pre-cluster the data
to find the nearest points, search the nearest clusters
Recursive application
–
to search a cluster, use a Searcher!
©MapR Technologies - Confidential
15
©MapR Technologies - Confidential
16
x
©MapR Technologies - Confidential
17
©MapR Technologies - Confidential
18
©MapR Technologies - Confidential
19
x
©MapR Technologies - Confidential
20
But This Requires k-means!

Need a new k-means algorithm to get speed
–
–
–

Hadoop is very slow at iterative map-reduce
Maybe Pregel clones like Giraph would be better
Or maybe not
Streaming k-means is
–
–
–
One pass (through the original data)
Very fast (20 us per data point with threads)
Very parallelizable
©MapR Technologies - Confidential
21
Basic Method

Use a single pass of k-means with very many clusters
–

output is a bad-ish clustering but a good surrogate
Use weighted centroids from step 1 to do in-memory clustering
–
output is a good clustering with fewer clusters
©MapR Technologies - Confidential
22
Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid
if number of centroids > 10 log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease
©MapR Technologies - Confidential
23
How It Works

Result is large set of centroids
–
–
–
these provide approximation of original distribution
we can cluster centroids to get a close approximation of clustering original
or we can just use the result directly
©MapR Technologies - Confidential
24
Parallel Speedup?
200
✓
Non- threaded
Tim e per point (μs)
100
2
Threaded version
3
50
4
40
6
5
30
8
10
14
12
Perfect Scaling
20
16
10
1
2
3
4
5
Threads
©MapR Technologies - Confidential
25
20
Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!
©MapR Technologies - Confidential
26
Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
©MapR Technologies - Confidential
27
Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)

Empirically, projection search beats 64 bit LSH by a bit
©MapR Technologies - Confidential
28
Moving to Scale

Map-reduce implementation nearly trivial

Map: rough-cluster input data, output ß, weighted centroids

Reduce:
–
–
–

single reducer gets all centroids
if too many centroids, merge using recursive clustering
optionally do final clustering in-memory
Combiner possible, but essentially never important
©MapR Technologies - Confidential
29

Contact:
–
–

[email protected]
@ted_dunning
Slides and such:
–
http://info.mapr.com/ted-mlconf.html
Hash tags: #mlconf #mahout #mapr
©MapR Technologies - Confidential
30