Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data
Download ReportTranscript Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data
Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik Interactive Data Exploration (IDE) Searching for the “interesting” within big data Where’s Horrible Where’s Gelatinous Waldo? Blob? • Exploratory-analysis: ad-hoc & repetitive – Questions are not well defined – “Interesting” can be complex – Hard to find – Hard to compute – Fast, online results (human-in-the-loop) 2 Exploratory Queries: Some examples • First-order – “Celestial 3-5o by 5-7o regions with brightness > 0.8” • Higher-order – “Pairs of 2o by 2o celestial regions with similarity > 0.5” • Optimized – “Celestial 3o by 7o region with maximum brightness” Sloan Digital Sky Survey (SDSS) 3 Sub-sequence Matching 4 Two Sides of Data Exploration • Search complexity – Search space is large • Enumeration isn’t feasible CP – Constraints are elaborate • More than just ranges • Data complexity – Large data sets (“big data”) • Hard to fit in memory – Expensive computations DBMS • Functions over a lot of objects 5 DBMSs for IDE? • No native support for exploratory constructs – No power set – Limited support for user-defined logic • Poor support for interactivity 6 “Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL 1. Divide the data into cells 2. Enumerate all regions 3. Final filtering (> 0.8) 7 Data Exploration as a CP problem “Celestial 3-5o by 5-7o regions with average brightness > 0.8” Decision variables: 𝑟 ∈ [100, 200] 𝑑 ∈ [5, 40] Leftmost corner 𝑟𝑙 ∈ [3, 5] Lengths 𝑑𝑙 ∈ [5, 7] Constraints: 𝑎𝑣𝑔_𝑏𝑟(𝑟, 𝑑, 𝑟𝑙, 𝑑𝑙) > 0.8 8 CP Solvers • Large variety of methods for exploring a search space – Branch-and-Cut – Large Neighborhood Search (LNS) – Randomized search with Restarts • Highly extensible – important for ad-hoc exploration! – New constraints/functions – New search heuristics • But… comparing with DBMSs – In-memory data (CP) vs. efficient disk data handling (DBMS) – No I/O cost-awareness (CP) vs. cost-based query planning (DBMS) 9 CP + DBMS for Data Intensive Exploration 10 Exploring Alternatives • Large search space, time-limited execution (1 hour) Approach First results, s Subsequent delays, s Searchlight 5 6 CP NA NA SciDB NA NA • Small search space Approach First result, s Total time, s Searchlight 4.8 5.13 CP 91 304 SciDB 301.3 945.3 • 8 node cluster • 120 GB data • 2GB memory 11 Dynamic Solve-Validate Approach SciDB Instance CP Solver Candidate Solutions Validator R o u t e r Synopsis Array Data Array • CP Solver: runs the CP search process • Synopsis-only access • Produces candidate solutions • Validator: validates candidates • Data access • Produces real solutions • CP-based • Router • Provides uniform access to data • CP Solver Synopsis, Validator Data • Supports transparency 12 1x1 2x2 4x4 Synopsis Pyramid Min: 1 Max: 5 Sum: 36 Count: 13 Min: 1 Max: 4 Sum: 10 Count: 4 Min: 2 Max: 5 Sum: 13 Count: 4 Min: 2 Max: 4 Sum: 9 Count: 3 Min: 1 Max: 3 Sum: 4 Count: 2 Allows approximate answers for data API calls • Lower and upper bounds, 100% confidence Synopsis resolution trade-off: • Coarse synopses are less accurate, but fast • Fine synopses are very accurate, but slow • Dynamic synopsis choice based on the query region 1 2 5 3 3 4 3 2 2 4 1 3 3 13 Distributed Searchlight in a Nutshell CP Solver CP Solver CP Solver CP Solver • Layer of CP Solvers • Search balancing • Multiple solvers per instance • Depending on free CPU cores • Layer of validators • Data partitioning • Multiple validators per instance • Depending on free CPU cores Validator Validator Validator • Disjoint layers • Different number of processes • No mandatory collocation • Dynamic allocation 14 1 2 3 4 1 2 3 4 1 2 Static Search Partitioning • Two-dimensional search space • Variables: x, y • Interval domains • Search partitions • Divide intervals • Each solver gets a slice • Features • Works with any heuristic • Covers hot-spots Dynamic Search Balancing x = [0, 100] x = [0, 50] x = [50, 100] 1. Go to [0, 50] 2. Help available! Push [0, 50] to the helper 3. Go to [50, 100] Busy Solver 1. Idle! 2. Got [0, 50] 3. Explore as its own search partition Idle Solver 16 Data Partitioning Data array Validator 1 Validator 2 • No data prefetching • Fetch only when needed (i.e., on validations) • Data transfer transparent for validators Validator 3 17 Other Optimizations • Using synopses for validations • Query region must be aligned with the grid • Dividing data partitions into zones • Avoids thrashing • Validating candidates from recent zones • Solver/Validator balancing • Dynamically redistributed CPU between Solvers/Validators • Many candidates more Validators; and vice versa • Utilize idle times for validations SDSS Results • Google’s Or-Tools + SciDB • 80GB SDSS • Varying selectivity: grid size, region size, magnitudes Query First Min/avg/max delays Total Q1 10 0.001/2/54 300 Q2 17 17 132 Q3 24 0.004/6/45 331 Q4 29 0.21/13/29 134 19 Related Work • PackageBuilder (UMass Amherst & NYU Abu Dhabi) • Sets of tuples with global constraints • Pruning, local search, MIP • Constraint Programming • Solvers, parallel search, heuristics,… • DBMSs & Spatial DBMSs • “Simple” retrieval queries • Content-Based Retrieval (CBR) 20 Ongoing Work • Query planning for search queries – Higher-order queries – DBMS integration (e.g., push-down predicates) • Exploring new datasets/constraints – MIMIC dataset – Sub-sequence matching 21 Thank you! Questions? Search process for a backtracking CP solver ra = [100, 200] dec = [5, 40] ra [100, 132] [133, 165] [166, 200] ra = [133, 165] dec = [5, 40] dec [5, 16] [17, 28] [29, 40] ra = [133, 165] dec = [5, 28] ra = [133, 165] dec = [29, 40] ra = [100, 132] U [166, 200] dec = [5, 40] Fail! ra = [134, 165] dec = [29, 40] ra = 133 dec = [29, 40] ra = 133 dec = 29 … … … “Celestial 3-5o by 5-7o regions with average brightness > 2” Decision variables: 𝑟 ∈ [100, 200] 𝑑 ∈ [5, 40] 𝑟𝑙 ∈ [3, 5] 𝑑𝑙 ∈ [5, 7] CP “UDFs”: • z = avg(…) • z = (2, +inf) • Accesses the data • Provides min/max values Constraints avg(r, d, rl, dl) > 2 r + rl – 1 <= 200 d + dl – 1 <= 40 UDF Searchlight API calls: • aggregate(X1, X2) • elem(X) 24 1 2 5 3 3 4 3 2 2 4 1 3 3 1 3 2 4 5 4 4 2 2 2 1 3 3 Upper Bound Min: 1 Max: 4 Sum: 10 Count: 4 Min: 2 Max: 4 Sum: 9 Count: 3 Min: 2 Max: 5 Sum: 13 Count: 4 Min: 1 Max: 3 Sum: 4 Count: 2 1 2 2 5 3 4 2 4 2 3 1 4 3 Lower Bound Synopsis is lossy compression: • Top-right cell: (5,3,3,2) • Min=2, Max=5, Sum=13, Count=4 • Cell distributions. Is it: • (5, 2, 4, 2)? • (2, 5, 2, 4)? • (5, 3, 3, 2)? Synopsis answers API calls: • elem(0, 0) [1, 4] • avg(white) [m, M] • m – lower bound • M – upper bound Upper Bound Example Min: 1 Max: 4 Sum: 10 Count: 4 Min: 2 Max: 5 Sum: 13 Count: 4 Min: 2 Max: 4 Sum: 9 Count: 3 Min: 1 Max: 3 Sum: 4 Count: 2 1 2 5 2 3 4 4 2 4 2 • Full cells: a = 10/4 =2.5 • Partial cells; (value, count) – from stats: – (5, 1), (4, 1), (2, 2) – (4, 1), (3, 1), (2, 1) – (3, 1), (1, 1) • Add in the descending order: 1. 2. 3. 4. 10/4 + (5, 1) = 15/ 5 = 3 15/5 + (4, 1) = 19/6 = 3.17 19/6 + (4, 1) = 23/7 = 3.29 Stop! 1 3 3 26 Intuition: Cell Coverage Cell Coverage = area of intersection / area of cell 1 2 5 2 3 4 4 2 4 2 “Good” coverage: cell is covered 50% • More-less enough information 1 3 3 “Bad” coverage: cell is covered 25% (< 50) • Too little information 27 Dynamic Synopsis Choice • 50,000x50,000 array • Different synopsis resolutions • Query completion times Search space Large 1000x1000 NA 100x100 4m41s 10x10 2h28m 1000-100-10 3m Small 21m30s 15m 6m9s 1m10s Solver times: 6s 6m 28 Dynamic Search Balancing • Idle solvers have to report to the coordinator • Coordinator dispatches helpers – Queue of busy solvers – Got help? Go to the end of the queue – Solvers may reject help (e.g., they’re finishing) • Dynamic approach – Busy solvers might have several helpers – Helpers might have helpers 29 Individual solver times Time, s 500 400 300 200 100 0 LSS-HS, Static 300 250 200 150 100 50 0 LSS-HS, Dynamic 8 40 100 500 Slices Candidate Zones For each candidate: 1. Determine chunks 2. Put into zone with most chunks CP Solver Zone 1 Zone 2 Zone 3 Data Partition Zone 4 Zone5 Validator: • Validates from the same zone • Recent zones first 31 Dynamic Candidates Forwarding Node 1 Node 2 Validator Validator 5,000 5,000 candidates Candidates forwarding: • Might cause data replication • Needed when validators are flooded • Only to idle validators • Forward to recent 5,000 candidates 10,000 candidates 32 1 node 20 2 nodes 4 nodes 80 SSS-LS 15 60 10 40 5 20 < 1sec 0 500 0 800 LSS-LS 400 LSS-HS 600 300 400 200 100 0 8 nodes LSS-ANO 200 < 1sec First result Avg delay Max delay 0 First result Avg delay Max delay MIMIC • MIMIC • Contains waveforms from ICU • Two-dimensional array: (patient, time) • Multiple signals: ABP, ECG, etc. • Queries • Aggregate search (e.g., anomalies) • Sub-sequence matching (e.g., find a pattern similar to query) 34 Sub-sequence Matching • Sub-sequence matching • Distance-based • Usually, sequence of DFTs traces index • Then, nearest-neighbor retrieval • Applying Searchlight • Index is a synopsis • API call: distance between the current area and query sequence • Expecting small overhead 35 Distributed Challenges 1. Search space partitioning 2. Data partitioning 3. Where to send candidates? – Solvers/validators might be disjoint – We don’t know the data the validation needs 36 Simulating Validations CP Solver Simulation Candidates 1. 2. 3. 4. 5. Validator Router Data Array Candidate is submitted to the validator Validator checks on real data (via router) Validator “checks” on dumb data: (–inf, +inf) Access collector writes down all accesses Now we know the chunks! Forwarding Access Collector 1. Knowing the chunks, choose a validator 2. Forward the candidate to the validator (or keep it) Other Optimizations • Solver/Validator balancing • Dynamically redistributed CPU between Solvers/Validators • Many candidates more Validators; and vice versa • Utilize idle times for validations • Candidates relocation • Will cause data movement – used rarely • Relocate only to idle validators • Try reusing validators