Transcript ppt

Mr. Scan: Efficient Clustering with
MRNet and GPUs
Evan Samanas and Ben Welton
Paradyn Project
Paradyn / Dyninst Week
Madison, Wisconsin
April 29-May 3, 2013
Density-based clustering
o Discovers the number of clusters
o Finds oddly-shaped clusters
Mr. Scan: Efficient Clustering with MRNet and GPUs
2
Clustering Example (DBSCAN[1])
The
Fortwo
every
parameters
discoveredthat
point,
determine
this same
if a
If Goal:
the number
Find regions
of points
thatinmeet
Eps isminimum
> MinPts,
calculation
point is inisa performed
cluster is Epsilon
until the
(Eps),
cluster
and is
density and
thespatial
point isdistance
a core point.
characteristics
fullyMinPts
expanded
MinPts
Eps
MinPts: 3
[1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with
noise, (1996)
Mr. Scan: Efficient Clustering with MRNet and GPUs
3
Scaling DBSCAN
o PDBSCAN (1999)[2]
o Quality equivalent to single DBSCAN
o Linear speedup up to 8 nodes
o DBDC (2004)[3]
o Sacrifices quality
o ~30x speedup on 15 nodes
o PDSDBSCAN (2012) [4]
o Quality equivalent to single node DBSCAN
o 5675x Speedup on 8192 nodes (72 Million Points)
o 2 Map/Reduce attempts (2011, 2012)
o Quality equivalent to single node DBSCAN
o 6x speedup on 12 nodes
[2] X. Xu et. al., A fast Parallel Clustering Algorithm for Large Spatial Databases (1999)
[3] E. Januzaj et. al., DBDC: Density Based Distributed Clustering (2004)
[4] M Patwary et. al., A new scalable parallel DBSCAN algorithm using the disjoint-set data structure (2012)
Mr. Scan: Efficient Clustering with MRNet and GPUs
4
Challenges of scaling DBSCAN
o Data distribution
o How do we effectively take an input file and create
partitions that can be clustered by DBSCAN?
o Distributed 2-D partitioner reading from a distributed file system
o Load balancing
o How to keep variance in clustering times across nodes to
a minimum?
o Dense Box
o Merge
o How do we reduce the amount of data needed for the
merge while keeping accuracy high?
o Representative points
Mr. Scan: Efficient Clustering with MRNet and GPUs
5
MRNet – Multicast / Reduction Network
o General-purpose TBON
API
o Network: user-defined topology
o Stream: logical data channel
FE
F(x1,…,xn)
CP
o to a set of back-ends
o multicast, gather, and custom reduction
CP
o Packet: collection of data
o Filter: stream data operator
o synchronization
o transformation
CP
o Widely adopted by HPC
tools
o
o
o
o
o
CEPBA toolkit
Cray ATP & CCDB
Open|SpeedShop & CBTF
STAT
TAU
BE
app
app
app
app
…
CP
BE
app
app
app
app
Mr. Scan: Efficient Clustering with MRNet and GPUs
CP
…
BE
CP
…
BE
app
app
app
app
app
app
app
app
6
TBON Computation
~30 sec
Total Time: ~60
Ideal Characteristics:
~10 sec
o Filter output size
constant or decreasing
o Computation rate
similar across levels
o Adjustable for load
balance
~10 sec
CP
FE
Packet Size:
≤10 MB
CP
Packet Size:
≤10 MB
BE
BE
~10 sec
app
app
app
app
4x
…
BE
BE
~40 sec
app
app
app
app
Data Size:
10MB per BE
Mr. Scan: Efficient Clustering with MRNet and GPUs
app
app
app
app
app
app
app
app
~10 sec
7
Intro to Mr. Scan
Merge
Mr. Scan Phases
FE
Partition: Distributed
Merge
Sweep
CP
DBSCAN: GPU(@ BE)
CP
DBSCAN
Merge: CPU (x #levels)
Sweep
BE
BE
BE
BE
Sweep: CPU (x #levels)
FE
BE
BE
FS
BE
BE
Mr. Scan: Efficient Clustering with MRNet and GPUs
8
Mr. Scan Architecture
Partitioner
FS Read
224 Secs
FS Write
489 Secs
FS Read: 24 Secs
DBSCAN
168 Secs
Write Output:
19 Secs
Merge & Sweep
Merge Time: 6 Secs
Sweep Time: 4 Secs
Time: 0
Clustering 6.5 Billion Points
Mr. Scan: Efficient Clustering with MRNet and GPUs
DBSCAN
MRNet
Startup
130 Secs
Time: 18.2 Min
9
Partition Phase
o Goal: Partitions computationally equivalent to DBSCAN
o Algorithm:
o Form initial partitions
o Add shadow regions
o Rebalance
Mr. Scan: Efficient Clustering with MRNet and GPUs
10
Distributed Partitioner
Mr. Scan: Efficient Clustering with MRNet and GPUs
11
GPU DBSCAN Filter
DBSCAN is performed in two distinct steps
Step 2: Expand core points and color
Step 1: Detect Core Points
Block 1
T
1
T
2
Block 1
T
512
T
1
Block 2
T
1
T
2
T
2
T
512
Block 2
T
512
T
1
Block 900
T
1
T
2
T
512
T
2
T
512
Block 900
T
1
T
2
T
512
Mr. Scan: Efficient Clustering with MRNet and GPUs
12
Dense Box
• One significant scalability
issue is dealing with dense
regions of data
• Density increases the
computation cost of DBSCAN
R2
R1
• We reduce the computation cost of
high density regions by preclustering these regions
KD-Tree
`
R2 Requires
more
comparison
operations
Look
at eachneeds
leaf bounding
box looking
DBSCAN
no longer
to
for boxes
with point count > minpts and
expand these
regions
size < 0.35 * eps
Mr. Scan: Efficient Clustering with MRNet and GPUs
13
Merge Algorithm
o Merge overlapping clusters found on different
nodes.
o Two steps in the merge operation
1. Select Representative points (BE)
2. Merge operation
Mr. Scan: Efficient Clustering with MRNet and GPUs
14
Representative Points
o These are points that represent the core points
in the dataset.
o Create a boundary which at least one core point
shared between overlapping clusters must be
contained.
Representative points are the
points closest to the corners and
middle of the side of the eps box
These points create a boundary
(shaded region) which a point must
fall in to merge overlapping clusters
Mr. Scan: Efficient Clustering with MRNet and GPUs
15
Merge Algorithm
• Merge algorithm is responsible for merging overlapping clusters detected on
different DBSCAN nodes.
• Need to handle the merge with low overhead and without the full dataset
1. Core/Core overlap
Node 1
2. Non-core/Core overlap
Core Point
Core Point
Non-Core
Point
Non-Core
Point
Node 2
Core Point in common. 64 operations to
detect.
Node 1
Node 2
Core point seen as non-core by one node. MinPts * 2
operations required to detect
Mr. Scan: Efficient Clustering with MRNet and GPUs
16
Sweep Step
o Get cluster identifiers and file offsets down to
BE’s to write final clusters.
o FE gives each cluster a unique ID and a file offset.
o This data is passed back down to the BE that
holds the data in the cluster.
o Data is written out to disk by the BE.
Mr. Scan: Efficient Clustering with MRNet and GPUs
17
Experiment Setup
o Dataset: Generated data with distribution from
real Twitter data
o Measuring:
o Weak Scaling up to 8192 GPUs
o Strong Scaling
o Quality compared to single-threaded DBSCAN
Mr. Scan: Efficient Clustering with MRNet and GPUs
18
Results
Weak Scaling: 4096x data/compute increase
18.48x-31.68x time increase
Mr. Scan: Efficient Clustering with MRNet and GPUs
19
Results Breakdown – Partition Phase
@ 6.5 Billion Points: 65.9% of Mr. Scan’s time
94.6% I/O time
Mr. Scan: Efficient Clustering with MRNet and GPUs
20
Results Breakdown – GPU Cluster Time
Mr. Scan: Efficient Clustering with MRNet and GPUs
21
Strong Scaling
Mr. Scan: Efficient Clustering with MRNet and GPUs
22
Quality
Mr. Scan: Efficient Clustering with MRNet and GPUs
23
Future Work
o Remove partitioner’s I/O bottleneck
o Multiple dimensions
Mr. Scan: Efficient Clustering with MRNet and GPUs
24
Conclusion
o Clustered 6.5 billion points with DBSCAN in 18.2
minutes
o Controlled computational variance of DBSCAN
o Partitioner I/O = scaling enemy
Mr. Scan: Efficient Clustering with MRNet and GPUs
25
Questions?
A Brief Discussion of Ways and Means
26
Summary of previous Mr. Scan implementation
Algorithm Steps
SpatialDecomp: CPU(@ FE)
FE
DBSCAN: CPU or GPU(@ BE)
MergeCluster
CP
CP
DBSCAN
DrawBoundBox: CPU or GPU
MergeCluster: CPU (x #levels)
BE
BE
BE
BE
Mr. Scan: Efficient Clustering with MRNet and GPUs
27