Database/Network Convergence
Download
Report
Transcript Database/Network Convergence
Adaptive Dataflow:
A Database/Networking
Cosmic Convergence
Joe Hellerstein
UC Berkeley
Road Map
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:
Sensor networks
P2P networks
Background: CONTROL project
Online/Interactive query processing
Online aggregation
Scalable spreadsheets & refining
visualizations
Online data cleaning (Potter’s Wheel)
Pipelining operators (ripple joins,
online reordering) over streaming
samples
Example: Online Aggregation
Online Data Visualization
CLOUDS
Potter’s Wheel
Goals for Online Processing
Performance metric:
Statistical (e.g. conf. intervals)
User-driven (e.g. weighted by widgets)
New “greedy” performance regime
Maximize 1st derivative of the “mirth index”
Mirth defined on-the-fly
Therefore need FEEDBACK and CONTROL
100%
Online
Traditional
Time
CONTROL Volatility
Goals and data may change over time
User feedback, sample variance
Goals and data may be different in different
“regions”
Group-by, scrollbar position
[An aside: dependencies in selectivity estimation]
Q: Query optimization in this world?
Or in any pipelining, volatile environment??
Where else do we see volatility?
Continuous Adaptivity: Eddies
Eddy
A little more state per tuple
Ready/done bits (extensible a la
Volcano/Starburst)
Query processing = dataflow routing!!
We'll come back to this!
Eddies: Two Key Observations
Break the set-oriented boundary
Usual DB model: algebra expressions: (R S) T
Usual DB implementation: pipelining operators!
Subexpressions never materialized
Typical implementation is more flexible than algebra
We can reorder in-flight operators
Other gains possible by breaking the set-oriented
boundary…
Don’t rewrite graph. Impose a router
Graph edge = absence of routing constraint
Observe operator consumption/production rates
Consumption: cost
Production: cost*selectivity
Road Map
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:
Sensor networks
P2P networks
Coincidence: Eddie Comes to
Berkeley
CLICK: a NW router is a query plan!
“The Click Modular Router”, Robert Morris, Eddie Kohler,
John Jannotti, and M. Frans Kaashoek, SOSP ‘99
Also Scout
Paths the key to comm-centric OS
“Making Paths Explicit in the Scout Operating System”,
David Mosberger and Larry L. Peterson. OSDI ‘96.
Figure 3:Example Router Graph
More Interaction: CS262
Experiment w/ Eric Brewer
Merge OS & DBMS grad class, over a year
Eric/Joe, point/counterpoint
Some tie-ins were obvious:
memory mgmt, storage, scheduling, concurrency
Surprising: QP and networks go well side by
side
E.g. eddies and TCP Congestion Control
Both use back-pressure and simple Control Theory to
“learn” in an unpredictable dataflow environment
Eddies close to the n-armed bandit problem
Networking Overview for DB
People Like Me
Core function of protocols: data xfer
Data Manipulation (buffer, checksum, encryption,
xfer to/fr app space, presentation)
Transfer Control (flow/congestion ctl, detecting
xmission probs, acks, muxing, timestamps,
framing)
-- Clark & Tennenhouse, “Architectural Considerations for a
New Generation of Protocols”, SIGCOMM ‘90
Basic Internet assumption:
“a network of unknown topology and with an
unknown, unknowable and constantly changing
population of competing conversations” (Van
Jacobson)
Query
Opt!
C & T’s Wacky Ideas
Thesis: nets are good at xfer control, not so
good at data manipulation
Some C&T wacky ideas for better data
manipulation
Data Modeling!
Xfer semantic units, not packets (ALF)
Auto-rewrite layers to flatten them (ILP)
Minimize cross-layer ordering constraints
Control delivery in parallel via packet content
Exchange!
Wacky New Ideas in QP
What if…
We had unbounded data producers and consumers
(“streams” … “continuous queries”)
We couldn’t know our producers’ behavior or contents??
(“federation” … “mediators”)
We couldn’t predict user behavior? (“control”)
We couldn’t predict behavior of components in the
dataflow? (“networked services”)
We had partial failure as a given? (oops, have we ignored
this?)
Yes … networking people have been here!
Remember Van Jacobson’s quote?
The Cosmic Convergence
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive Query Continuous Approximate/
Processing
Queries Interactive QP
Sensor
Databases
Content-Based Router Content Addressable Directed
Toolkits
Diffusion
Routing
Networks
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
The Cosmic Convergence
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive Query Continuous Approximate/
Processing
Queries Interactive QP
Sensor
Databases
Telegraph
Content-Based Router Content Addressable Directed
Toolkits
Diffusion
Routing
Networks
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
Road Map
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:
Sensor networks
P2P networks
What’s in the Sweet Spot?
Scenarios with:
Structured Content
Volatility
Rich Queries
Clearly:
Long-running data analysis a la CONTROL
Continuous queries
Queries over Internet sources and services
Two emerging scenarios:
Sensor networks
P2P query processing
Telegraph: Engineering the
Sweet Spot
An adaptive dataflow system
Dataflow programming model
A la Volcano, CLICK: push and pull. “Fjords”, ICDE02
Extensible set of pipelining operators, including
relational ops, grouped filters (e.g. XFilter)
SQL parser for convenience (looking at XQuery)
Adaptivity operators
Eddies
+ Extensible rules for routing constraints, Competition
SteMs (state modules)
FLuX (Fault-tolerant Load-balancing eXchange)
Bounded and continuous:
Data sources
Queries
State Modules (SteMs)
Goal: Further adaptivity through
competition
dataflow
Multiple mirrored sources
static
Handle rate changes, failures,
parallelism
Multiple alternate operators
Join = Routing + State
SteM operator manages tradeoffs
eddy
State Module, unifies caches,
rendezvous buffers, join state
Competitive sources/operators
share building/probing SteMs
Join algorithm hybridization!
Vijayshankar Raman
eddy
+
stems
FLuX: Routing Across Cluster
Fault Tolerance, Load Balancing
Continuous/long-running flows need high availability
Big flows need parallelism
Adaptive Load-Balancing req’d
FLuX operator: Exchange plus…
Adaptive flow partitioning (River)
Transient state replication & migration
Needs to be extensible to different ops:
Content-sensitivity
History-sensitivity
Dataflow semantics
RAID for SteMs
Optimize based on edge semantics
Networking tie-in again:
• At-least-once delivery?
• Exactly-once delivery?
• In/Out of order?
Migration policy: the ski rental analogy
Mehul Shah
Continuously Adaptive
Continuous Queries (CACQ)
Continuous Queries clearly need all this stuff! Address
adaptivity 1st.
4 Ideas in CACQ:
Use eddies to allow reordering of ops.
Explicit tuple lineage
Mark each tuple with per-op ready/done bits
Mark each tuple with per-query completed bits
Queries are data: join with Grouped Filter
But one eddy will serve for all queries
Much like XFilter, but for relational queries
Joins via SteMs, shared across all queries
Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared
algebraic expressions!
Delete a tuple from flow only if it matches no query
Next: F.T. CACQ via FLuXen
Sam Madden, Mehul Shah, Vijayshankar Raman
Road Map
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:
Sensor networks
P2P networks
Sensor Nets
“Smart Dust” + TinyOS
Thousands of “motes”
Expensive communication
Power constraints
Query workload:
Aggregation & approximation
Queries and Continuous Queries
Simple example:
Aggregation query
Challenges:
Push the processing into the network
Deal with volatility & failure
CONTROL issues: data variance, user desires
Joint work with Ramesh Govindan, Sam Madden,
Wei Hong and David Culler (Intel Berkeley Lab)
P2P QP
Starting point: P2P as grassroots phenomenon
Outrageous filesharing volume (1.8Gfiles in October 2001)
No business case to date
Challenge: scale DDBMS QP ideas to P2P
Motivate why
Pick the right parts of DBMS research to focus on
Storage: no! QP: yes.
Make it work:
Scalability well beyond our usual target
Admin constraints
Unknown data distributions, load
Heterogeneous comm/processing
Partial failure
Joint work with Scott Shenker, Ion Stoica, Matt
Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo
A Grassroots Example: TeleNap
Themes Throughout
Adaptivity
Requires clever system design
Interesting adaptive policy problems
The Exchange model: encapsulate in ops?
E.g. eddy routing, flux migration
Control Theory, Machine Learning
Encompasses another CS goal?
“No-knobs”, “Autonomic”, etc.
New performance regimes
Decent performance in the common case
Mean/Variance more important than MAX
Interactive Metrics
Time to completion often unimportant/irrelevant
More Themes
Set-valued thinking as albatross?
E.g. eddies vs. Kabra/DeWitt or Tukwila
E.g. SteMs vs. Materialized Views
E.g. CACQ vs. NiagaraCQ
Some clean theory here would be nice
Current routing correctness proofs are inelegant
Extensibility
Model/language of choice is not clear
SEQ? Relational? XQuery?
Extensible operators, edge semantics
[A whine about VLDB’s absurd “Specificity
Factor”]
Conclusions?
Too early for technical conclusions
Of this I’m sure:
The CS262 experiment is a success
Nets folks are coming our way
Our students are getting a bigger picture than before
I’m learning, finding new connections
May morph to OS/Nets, Nets/DB
Eventually rethink the systems software curriculum at
the undergraduate level too
Doing relevant work, eager to collaborate
DB community needs to branch out
Outbound: Better proselytizing in CS
Inbound: Need new ideas
Conclusions, cont.
Sabbatical is a good invention
Hasn’t even started, I’m already grateful!