Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data

Download Report

Transcript Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data

Searchlight: Enabling Integrated Search
and Exploration over Large
Multidimensional Data
Alexander Kalinin, Ugur Cetintemel, Stan Zdonik
Interactive Data Exploration (IDE)
Searching for the “interesting” within big data
Where’s Horrible
Where’s Gelatinous
Waldo? Blob?
• Exploratory-analysis: ad-hoc & repetitive
– Questions are not well defined
– “Interesting” can be complex
– Hard to find
– Hard to compute
– Fast, online results (human-in-the-loop)
2
Exploratory Queries: Some examples
•
First-order
– “Celestial 3-5o by 5-7o regions with brightness > 0.8”
•
Higher-order
– “Pairs of 2o by 2o celestial regions with similarity > 0.5”
•
Optimized
– “Celestial 3o by 7o region with maximum brightness”
Sloan Digital Sky Survey (SDSS)
3
Sub-sequence Matching
4
Two Sides of Data Exploration
• Search complexity
– Search space is large
• Enumeration isn’t feasible
CP
– Constraints are elaborate
• More than just ranges
• Data complexity
– Large data sets (“big data”)
• Hard to fit in memory
– Expensive computations
DBMS
• Functions over a lot of objects
5
DBMSs for IDE?
• No native support for exploratory constructs
– No power set
– Limited support for user-defined logic
• Poor support for interactivity
6
“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL
1. Divide the data into cells
2. Enumerate all regions
3. Final filtering (> 0.8)
7
Data Exploration as a CP problem
“Celestial 3-5o by 5-7o regions with average brightness > 0.8”
Decision variables:
𝑟 ∈ [100, 200]
𝑑 ∈ [5, 40]
Leftmost corner
𝑟𝑙 ∈ [3, 5]
Lengths
𝑑𝑙 ∈ [5, 7]
Constraints:
𝑎𝑣𝑔_𝑏𝑟(𝑟, 𝑑, 𝑟𝑙, 𝑑𝑙) > 0.8
8
CP Solvers
• Large variety of methods for exploring a search space
– Branch-and-Cut
– Large Neighborhood Search (LNS)
– Randomized search with Restarts
• Highly extensible – important for ad-hoc exploration!
– New constraints/functions
– New search heuristics
• But… comparing with DBMSs
– In-memory data (CP) vs. efficient disk data handling (DBMS)
– No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)
9
CP + DBMS
for Data Intensive Exploration
10
Exploring Alternatives
• Large search space, time-limited execution (1 hour)
Approach
First results, s
Subsequent delays, s
Searchlight
5
6
CP
NA
NA
SciDB
NA
NA
• Small search space
Approach
First result, s
Total time, s
Searchlight
4.8
5.13
CP
91
304
SciDB
301.3
945.3
• 8 node cluster
• 120 GB data
• 2GB memory
11
Dynamic Solve-Validate Approach
SciDB Instance
CP Solver
Candidate
Solutions
Validator
R
o
u
t
e
r
Synopsis
Array
Data
Array
• CP Solver: runs the CP search process
• Synopsis-only access
• Produces candidate solutions
• Validator: validates candidates
• Data access
• Produces real solutions
• CP-based
• Router
• Provides uniform access to data
• CP Solver  Synopsis, Validator Data
• Supports transparency
12
1x1
2x2
4x4
Synopsis Pyramid
Min: 1
Max: 5
Sum: 36
Count: 13
Min: 1
Max: 4
Sum: 10
Count: 4
Min: 2
Max: 5
Sum: 13
Count: 4
Min: 2
Max: 4
Sum: 9
Count: 3
Min: 1
Max: 3
Sum: 4
Count: 2
Allows approximate answers for data API calls
• Lower and upper bounds, 100% confidence
Synopsis resolution trade-off:
• Coarse synopses are less accurate, but fast
• Fine synopses are very accurate, but slow
• Dynamic synopsis choice based on the query region
1
2
5
3
3
4
3
2
2
4
1
3
3
13
Distributed Searchlight in a Nutshell
CP Solver
CP Solver
CP Solver
CP Solver
• Layer of CP Solvers
• Search balancing
• Multiple solvers per instance
• Depending on free CPU cores
• Layer of validators
• Data partitioning
• Multiple validators per instance
• Depending on free CPU cores
Validator
Validator
Validator
• Disjoint layers
• Different number of processes
• No mandatory collocation
• Dynamic allocation
14
1
2
3
4
1
2
3
4
1
2
Static Search Partitioning
• Two-dimensional search space
• Variables: x, y
• Interval domains
• Search partitions
• Divide intervals
• Each solver gets a slice
• Features
• Works with any heuristic
• Covers hot-spots
Dynamic Search Balancing
x = [0, 100]
x = [0, 50]
x = [50, 100]
1. Go to [0, 50]
2. Help available! Push [0, 50] to the helper
3. Go to [50, 100]
Busy Solver
1. Idle!
2. Got [0, 50]
3. Explore as its own search partition
Idle Solver
16
Data Partitioning
Data array
Validator 1
Validator 2
• No data prefetching
• Fetch only when needed (i.e., on validations)
• Data transfer transparent for validators
Validator 3
17
Other Optimizations
• Using synopses for validations
• Query region must be aligned with the grid
• Dividing data partitions into zones
• Avoids thrashing
• Validating candidates from recent zones
• Solver/Validator balancing
• Dynamically redistributed CPU between Solvers/Validators
• Many candidates  more Validators; and vice versa
• Utilize idle times for validations
SDSS Results
• Google’s Or-Tools + SciDB
• 80GB SDSS
• Varying selectivity: grid size, region size, magnitudes
Query
First
Min/avg/max delays
Total
Q1
10
0.001/2/54
300
Q2
17
17
132
Q3
24
0.004/6/45
331
Q4
29
0.21/13/29
134
19
Related Work
• PackageBuilder (UMass Amherst & NYU Abu Dhabi)
• Sets of tuples with global constraints
• Pruning, local search, MIP
• Constraint Programming
• Solvers, parallel search, heuristics,…
• DBMSs & Spatial DBMSs
• “Simple” retrieval queries
• Content-Based Retrieval (CBR)
20
Ongoing Work
• Query planning for search queries
– Higher-order queries
– DBMS integration (e.g., push-down predicates)
• Exploring new datasets/constraints
– MIMIC dataset
– Sub-sequence matching
21
Thank you!
Questions?
Search process for
a backtracking CP solver
ra = [100, 200]
dec = [5, 40]
ra
[100, 132] [133, 165]
[166, 200]
ra = [133, 165]
dec = [5, 40]
dec
[5, 16] [17, 28]
[29, 40]
ra = [133, 165]
dec = [5, 28]
ra = [133, 165]
dec = [29, 40]
ra = [100, 132] U [166, 200]
dec = [5, 40]
Fail!
ra = [134, 165]
dec = [29, 40]
ra = 133
dec = [29, 40]
ra = 133
dec = 29
…
…
…
“Celestial 3-5o by 5-7o regions with average brightness > 2”
Decision variables:
𝑟 ∈ [100, 200]
𝑑 ∈ [5, 40]
𝑟𝑙 ∈ [3, 5]
𝑑𝑙 ∈ [5, 7]
CP “UDFs”:
• z = avg(…)
• z = (2, +inf)
• Accesses the data
• Provides min/max values
Constraints
avg(r, d, rl, dl) > 2
r + rl – 1 <= 200
d + dl – 1 <= 40
UDF  Searchlight API calls:
• aggregate(X1, X2)
• elem(X)
24
1
2
5
3
3
4
3
2
2
4
1
3
3
1
3
2
4
5
4
4
2
2
2
1
3
3
Upper Bound
Min: 1
Max: 4
Sum: 10
Count: 4
Min: 2
Max: 4
Sum: 9
Count: 3
Min: 2
Max: 5
Sum: 13
Count: 4
Min: 1
Max: 3
Sum: 4
Count: 2
1
2
2
5
3
4
2
4
2
3
1
4
3
Lower Bound
Synopsis is lossy compression:
• Top-right cell: (5,3,3,2)
• Min=2, Max=5, Sum=13, Count=4
• Cell distributions. Is it:
• (5, 2, 4, 2)?
• (2, 5, 2, 4)?
• (5, 3, 3, 2)?
Synopsis answers API calls:
• elem(0, 0)  [1, 4]
• avg(white)  [m, M]
• m – lower bound
• M – upper bound
Upper Bound Example
Min: 1
Max: 4
Sum: 10
Count: 4
Min: 2
Max: 5
Sum: 13
Count: 4
Min: 2
Max: 4
Sum: 9
Count: 3
Min: 1
Max: 3
Sum: 4
Count: 2
1
2
5
2
3
4
4
2
4
2
• Full cells: a = 10/4 =2.5
• Partial cells; (value, count) – from stats:
– (5, 1), (4, 1), (2, 2)
– (4, 1), (3, 1), (2, 1)
– (3, 1), (1, 1)
• Add in the descending order:
1.
2.
3.
4.
10/4 + (5, 1) = 15/ 5 = 3
15/5 + (4, 1) = 19/6 = 3.17
19/6 + (4, 1) = 23/7 = 3.29
Stop!
1
3
3
26
Intuition: Cell Coverage
Cell Coverage = area of intersection / area of cell
1
2
5
2
3
4
4
2
4
2
“Good” coverage: cell is covered 50%
• More-less enough information
1
3
3
“Bad” coverage: cell is covered 25% (< 50)
• Too little information
27
Dynamic Synopsis Choice
• 50,000x50,000 array
• Different synopsis resolutions
• Query completion times
Search space
Large
1000x1000
NA
100x100
4m41s
10x10
2h28m
1000-100-10
3m
Small
21m30s
15m
6m9s
1m10s
Solver times: 6s  6m
28
Dynamic Search Balancing
• Idle solvers have to report to the coordinator
• Coordinator dispatches helpers
– Queue of busy solvers
– Got help? Go to the end of the queue
– Solvers may reject help (e.g., they’re finishing)
• Dynamic approach
– Busy solvers might have several helpers
– Helpers might have helpers
29
Individual solver times
Time, s
500
400
300
200
100
0
LSS-HS, Static
300
250
200
150
100
50
0
LSS-HS, Dynamic
8
40
100
500
Slices
Candidate Zones
For each candidate:
1. Determine chunks
2. Put into zone with most chunks
CP Solver
Zone 1
Zone 2
Zone 3
Data Partition
Zone 4
Zone5
Validator:
• Validates from the same zone
• Recent zones first
31
Dynamic Candidates Forwarding
Node 1
Node 2
Validator
Validator
5,000
5,000 candidates
Candidates forwarding:
• Might cause data replication
• Needed when validators are flooded
• Only to idle validators
• Forward to recent
5,000 candidates
10,000 candidates
32
1 node
20
2 nodes
4 nodes
80
SSS-LS
15
60
10
40
5
20
< 1sec
0
500
0
800
LSS-LS
400
LSS-HS
600
300
400
200
100
0
8 nodes
LSS-ANO
200
< 1sec
First result
Avg delay
Max delay
0
First result
Avg delay
Max delay
MIMIC
• MIMIC
• Contains waveforms from ICU
• Two-dimensional array: (patient, time)
• Multiple signals: ABP, ECG, etc.
• Queries
• Aggregate search (e.g., anomalies)
• Sub-sequence matching (e.g., find a pattern similar to query)
34
Sub-sequence Matching
• Sub-sequence matching
• Distance-based
• Usually, sequence of DFTs traces index
• Then, nearest-neighbor retrieval
• Applying Searchlight
• Index is a synopsis
• API call: distance between the current area and query sequence
• Expecting small overhead
35
Distributed Challenges
1. Search space partitioning
2. Data partitioning
3. Where to send candidates?
– Solvers/validators might be disjoint
– We don’t know the data the validation needs
36
Simulating Validations
CP Solver
Simulation
Candidates
1.
2.
3.
4.
5.
Validator
Router
Data
Array
Candidate is submitted to the validator
Validator checks on real data (via router)
Validator “checks” on dumb data: (–inf, +inf)
Access collector writes down all accesses
Now we know the chunks!
Forwarding
Access
Collector
1. Knowing the chunks, choose a validator
2. Forward the candidate to the validator (or keep it)
Other Optimizations
• Solver/Validator balancing
• Dynamically redistributed CPU between Solvers/Validators
• Many candidates  more Validators; and vice versa
• Utilize idle times for validations
• Candidates relocation
• Will cause data movement – used rarely
• Relocate only to idle validators
• Try reusing validators