Transcript ppt
Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben Welton Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Density-based clustering o Discovers the number of clusters o Finds oddly-shaped clusters Mr. Scan: Efficient Clustering with MRNet and GPUs 2 Clustering Example (DBSCAN[1]) The Fortwo every parameters discoveredthat point, determine this same if a If Goal: the number Find regions of points thatinmeet Eps isminimum > MinPts, calculation point is inisa performed cluster is Epsilon until the (Eps), cluster and is density and thespatial point isdistance a core point. characteristics fullyMinPts expanded MinPts Eps MinPts: 3 [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) Mr. Scan: Efficient Clustering with MRNet and GPUs 3 Scaling DBSCAN o PDBSCAN (1999)[2] o Quality equivalent to single DBSCAN o Linear speedup up to 8 nodes o DBDC (2004)[3] o Sacrifices quality o ~30x speedup on 15 nodes o PDSDBSCAN (2012) [4] o Quality equivalent to single node DBSCAN o 5675x Speedup on 8192 nodes (72 Million Points) o 2 Map/Reduce attempts (2011, 2012) o Quality equivalent to single node DBSCAN o 6x speedup on 12 nodes [2] X. Xu et. al., A fast Parallel Clustering Algorithm for Large Spatial Databases (1999) [3] E. Januzaj et. al., DBDC: Density Based Distributed Clustering (2004) [4] M Patwary et. al., A new scalable parallel DBSCAN algorithm using the disjoint-set data structure (2012) Mr. Scan: Efficient Clustering with MRNet and GPUs 4 Challenges of scaling DBSCAN o Data distribution o How do we effectively take an input file and create partitions that can be clustered by DBSCAN? o Distributed 2-D partitioner reading from a distributed file system o Load balancing o How to keep variance in clustering times across nodes to a minimum? o Dense Box o Merge o How do we reduce the amount of data needed for the merge while keeping accuracy high? o Representative points Mr. Scan: Efficient Clustering with MRNet and GPUs 5 MRNet – Multicast / Reduction Network o General-purpose TBON API o Network: user-defined topology o Stream: logical data channel FE F(x1,…,xn) CP o to a set of back-ends o multicast, gather, and custom reduction CP o Packet: collection of data o Filter: stream data operator o synchronization o transformation CP o Widely adopted by HPC tools o o o o o CEPBA toolkit Cray ATP & CCDB Open|SpeedShop & CBTF STAT TAU BE app app app app … CP BE app app app app Mr. Scan: Efficient Clustering with MRNet and GPUs CP … BE CP … BE app app app app app app app app 6 TBON Computation ~30 sec Total Time: ~60 Ideal Characteristics: ~10 sec o Filter output size constant or decreasing o Computation rate similar across levels o Adjustable for load balance ~10 sec CP FE Packet Size: ≤10 MB CP Packet Size: ≤10 MB BE BE ~10 sec app app app app 4x … BE BE ~40 sec app app app app Data Size: 10MB per BE Mr. Scan: Efficient Clustering with MRNet and GPUs app app app app app app app app ~10 sec 7 Intro to Mr. Scan Merge Mr. Scan Phases FE Partition: Distributed Merge Sweep CP DBSCAN: GPU(@ BE) CP DBSCAN Merge: CPU (x #levels) Sweep BE BE BE BE Sweep: CPU (x #levels) FE BE BE FS BE BE Mr. Scan: Efficient Clustering with MRNet and GPUs 8 Mr. Scan Architecture Partitioner FS Read 224 Secs FS Write 489 Secs FS Read: 24 Secs DBSCAN 168 Secs Write Output: 19 Secs Merge & Sweep Merge Time: 6 Secs Sweep Time: 4 Secs Time: 0 Clustering 6.5 Billion Points Mr. Scan: Efficient Clustering with MRNet and GPUs DBSCAN MRNet Startup 130 Secs Time: 18.2 Min 9 Partition Phase o Goal: Partitions computationally equivalent to DBSCAN o Algorithm: o Form initial partitions o Add shadow regions o Rebalance Mr. Scan: Efficient Clustering with MRNet and GPUs 10 Distributed Partitioner Mr. Scan: Efficient Clustering with MRNet and GPUs 11 GPU DBSCAN Filter DBSCAN is performed in two distinct steps Step 2: Expand core points and color Step 1: Detect Core Points Block 1 T 1 T 2 Block 1 T 512 T 1 Block 2 T 1 T 2 T 2 T 512 Block 2 T 512 T 1 Block 900 T 1 T 2 T 512 T 2 T 512 Block 900 T 1 T 2 T 512 Mr. Scan: Efficient Clustering with MRNet and GPUs 12 Dense Box • One significant scalability issue is dealing with dense regions of data • Density increases the computation cost of DBSCAN R2 R1 • We reduce the computation cost of high density regions by preclustering these regions KD-Tree ` R2 Requires more comparison operations Look at eachneeds leaf bounding box looking DBSCAN no longer to for boxes with point count > minpts and expand these regions size < 0.35 * eps Mr. Scan: Efficient Clustering with MRNet and GPUs 13 Merge Algorithm o Merge overlapping clusters found on different nodes. o Two steps in the merge operation 1. Select Representative points (BE) 2. Merge operation Mr. Scan: Efficient Clustering with MRNet and GPUs 14 Representative Points o These are points that represent the core points in the dataset. o Create a boundary which at least one core point shared between overlapping clusters must be contained. Representative points are the points closest to the corners and middle of the side of the eps box These points create a boundary (shaded region) which a point must fall in to merge overlapping clusters Mr. Scan: Efficient Clustering with MRNet and GPUs 15 Merge Algorithm • Merge algorithm is responsible for merging overlapping clusters detected on different DBSCAN nodes. • Need to handle the merge with low overhead and without the full dataset 1. Core/Core overlap Node 1 2. Non-core/Core overlap Core Point Core Point Non-Core Point Non-Core Point Node 2 Core Point in common. 64 operations to detect. Node 1 Node 2 Core point seen as non-core by one node. MinPts * 2 operations required to detect Mr. Scan: Efficient Clustering with MRNet and GPUs 16 Sweep Step o Get cluster identifiers and file offsets down to BE’s to write final clusters. o FE gives each cluster a unique ID and a file offset. o This data is passed back down to the BE that holds the data in the cluster. o Data is written out to disk by the BE. Mr. Scan: Efficient Clustering with MRNet and GPUs 17 Experiment Setup o Dataset: Generated data with distribution from real Twitter data o Measuring: o Weak Scaling up to 8192 GPUs o Strong Scaling o Quality compared to single-threaded DBSCAN Mr. Scan: Efficient Clustering with MRNet and GPUs 18 Results Weak Scaling: 4096x data/compute increase 18.48x-31.68x time increase Mr. Scan: Efficient Clustering with MRNet and GPUs 19 Results Breakdown – Partition Phase @ 6.5 Billion Points: 65.9% of Mr. Scan’s time 94.6% I/O time Mr. Scan: Efficient Clustering with MRNet and GPUs 20 Results Breakdown – GPU Cluster Time Mr. Scan: Efficient Clustering with MRNet and GPUs 21 Strong Scaling Mr. Scan: Efficient Clustering with MRNet and GPUs 22 Quality Mr. Scan: Efficient Clustering with MRNet and GPUs 23 Future Work o Remove partitioner’s I/O bottleneck o Multiple dimensions Mr. Scan: Efficient Clustering with MRNet and GPUs 24 Conclusion o Clustered 6.5 billion points with DBSCAN in 18.2 minutes o Controlled computational variance of DBSCAN o Partitioner I/O = scaling enemy Mr. Scan: Efficient Clustering with MRNet and GPUs 25 Questions? A Brief Discussion of Ways and Means 26 Summary of previous Mr. Scan implementation Algorithm Steps SpatialDecomp: CPU(@ FE) FE DBSCAN: CPU or GPU(@ BE) MergeCluster CP CP DBSCAN DrawBoundBox: CPU or GPU MergeCluster: CPU (x #levels) BE BE BE BE Mr. Scan: Efficient Clustering with MRNet and GPUs 27