Database/Network Convergence

Transcript Database/Network Convergence

Adaptive Dataflow:
A Database/Networking
Cosmic Convergence
Joe Hellerstein
UC Berkeley
Road Map
How I got started on this


CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:


Sensor networks
P2P networks
Background: CONTROL project
Online/Interactive query processing
Online aggregation
 Scalable spreadsheets & refining
visualizations
 Online data cleaning (Potter’s Wheel)

Pipelining operators (ripple joins,
online reordering) over streaming
samples
Example: Online Aggregation
Online Data Visualization
CLOUDS
Potter’s Wheel
Goals for Online Processing
Performance metric: 


Statistical (e.g. conf. intervals)
User-driven (e.g. weighted by widgets)
New “greedy” performance regime



Maximize 1st derivative of the “mirth index”
Mirth defined on-the-fly
Therefore need FEEDBACK and CONTROL
100%
Online
Traditional

Time
CONTROL  Volatility
Goals and data may change over time

User feedback, sample variance
Goals and data may be different in different
“regions”


Group-by, scrollbar position
[An aside: dependencies in selectivity estimation]
Q: Query optimization in this world?


Or in any pipelining, volatile environment??
Where else do we see volatility?
Continuous Adaptivity: Eddies
Eddy
A little more state per tuple

Ready/done bits (extensible a la
Volcano/Starburst)
Query processing = dataflow routing!!

We'll come back to this!
Eddies: Two Key Observations
Break the set-oriented boundary


Usual DB model: algebra expressions: (R S) T
Usual DB implementation: pipelining operators!


Subexpressions never materialized
Typical implementation is more flexible than algebra


We can reorder in-flight operators
Other gains possible by breaking the set-oriented
boundary…
Don’t rewrite graph. Impose a router


Graph edge = absence of routing constraint
Observe operator consumption/production rates


Consumption: cost
Production: cost*selectivity
Road Map
How I got started on this


CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:


Sensor networks
P2P networks
Coincidence: Eddie Comes to
Berkeley
CLICK: a NW router is a query plan!

“The Click Modular Router”, Robert Morris, Eddie Kohler,
John Jannotti, and M. Frans Kaashoek, SOSP ‘99
Also Scout
Paths the key to comm-centric OS

“Making Paths Explicit in the Scout Operating System”,
David Mosberger and Larry L. Peterson. OSDI ‘96.
Figure 3:Example Router Graph
More Interaction: CS262
Experiment w/ Eric Brewer
Merge OS & DBMS grad class, over a year
Eric/Joe, point/counterpoint
Some tie-ins were obvious:

memory mgmt, storage, scheduling, concurrency
Surprising: QP and networks go well side by
side

E.g. eddies and TCP Congestion Control


Both use back-pressure and simple Control Theory to
“learn” in an unpredictable dataflow environment
Eddies close to the n-armed bandit problem
Networking Overview for DB
People Like Me
Core function of protocols: data xfer


Data Manipulation (buffer, checksum, encryption,
xfer to/fr app space, presentation)
Transfer Control (flow/congestion ctl, detecting
xmission probs, acks, muxing, timestamps,
framing)
-- Clark & Tennenhouse, “Architectural Considerations for a
New Generation of Protocols”, SIGCOMM ‘90
Basic Internet assumption:

“a network of unknown topology and with an
unknown, unknowable and constantly changing
population of competing conversations” (Van
Jacobson)
Query
Opt!
C & T’s Wacky Ideas
Thesis: nets are good at xfer control, not so
good at data manipulation
Some C&T wacky ideas for better data
manipulation
Data Modeling!

Xfer semantic units, not packets (ALF)
Auto-rewrite layers to flatten them (ILP)
Minimize cross-layer ordering constraints

Control delivery in parallel via packet content


Exchange!
Wacky New Ideas in QP
What if…





We had unbounded data producers and consumers
(“streams” … “continuous queries”)
We couldn’t know our producers’ behavior or contents??
(“federation” … “mediators”)
We couldn’t predict user behavior? (“control”)
We couldn’t predict behavior of components in the
dataflow? (“networked services”)
We had partial failure as a given? (oops, have we ignored
this?)
Yes … networking people have been here!

Remember Van Jacobson’s quote?
The Cosmic Convergence
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive Query Continuous Approximate/
Processing
Queries Interactive QP
Sensor
Databases
Content-Based Router Content Addressable Directed
Toolkits
Diffusion
Routing
Networks
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
The Cosmic Convergence
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive Query Continuous Approximate/
Processing
Queries Interactive QP
Sensor
Databases
Telegraph
Content-Based Router Content Addressable Directed
Toolkits
Diffusion
Routing
Networks
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
Road Map
How I got started on this


CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:


Sensor networks
P2P networks
What’s in the Sweet Spot?
Scenarios with:



Structured Content
Volatility
Rich Queries
Clearly:



Long-running data analysis a la CONTROL
Continuous queries
Queries over Internet sources and services
Two emerging scenarios:


Sensor networks
P2P query processing
Telegraph: Engineering the
Sweet Spot
An adaptive dataflow system

Dataflow programming model




A la Volcano, CLICK: push and pull. “Fjords”, ICDE02
Extensible set of pipelining operators, including
relational ops, grouped filters (e.g. XFilter)
SQL parser for convenience (looking at XQuery)
Adaptivity operators

Eddies




+ Extensible rules for routing constraints, Competition
SteMs (state modules)
FLuX (Fault-tolerant Load-balancing eXchange)
Bounded and continuous:


Data sources
Queries
State Modules (SteMs)
Goal: Further adaptivity through
competition



dataflow
Multiple mirrored sources


static
Handle rate changes, failures,
parallelism
Multiple alternate operators
Join = Routing + State
SteM operator manages tradeoffs



eddy
State Module, unifies caches,
rendezvous buffers, join state
Competitive sources/operators
share building/probing SteMs
Join algorithm hybridization!
Vijayshankar Raman
eddy
+
stems
FLuX: Routing Across Cluster
Fault Tolerance, Load Balancing


Continuous/long-running flows need high availability
Big flows need parallelism


Adaptive Load-Balancing req’d
FLuX operator: Exchange plus…


Adaptive flow partitioning (River)
Transient state replication & migration


Needs to be extensible to different ops:



Content-sensitivity
History-sensitivity
Dataflow semantics



RAID for SteMs
Optimize based on edge semantics
Networking tie-in again:
• At-least-once delivery?
• Exactly-once delivery?
• In/Out of order?
Migration policy: the ski rental analogy
Mehul Shah
Continuously Adaptive
Continuous Queries (CACQ)
Continuous Queries clearly need all this stuff! Address
adaptivity 1st.
4 Ideas in CACQ:

Use eddies to allow reordering of ops.


Explicit tuple lineage



Mark each tuple with per-op ready/done bits
Mark each tuple with per-query completed bits
Queries are data: join with Grouped Filter


But one eddy will serve for all queries
Much like XFilter, but for relational queries
Joins via SteMs, shared across all queries


Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared
algebraic expressions!
Delete a tuple from flow only if it matches no query
Next: F.T. CACQ via FLuXen
Sam Madden, Mehul Shah, Vijayshankar Raman
Road Map
How I got started on this


CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow
research
New arenas:


Sensor networks
P2P networks
Sensor Nets
“Smart Dust” + TinyOS
Thousands of “motes”
Expensive communication

Power constraints
Query workload:


Aggregation & approximation
Queries and Continuous Queries
Simple example:
Aggregation query
Challenges:



Push the processing into the network
Deal with volatility & failure
CONTROL issues: data variance, user desires
Joint work with Ramesh Govindan, Sam Madden,
Wei Hong and David Culler (Intel Berkeley Lab)
P2P QP
Starting point: P2P as grassroots phenomenon


Outrageous filesharing volume (1.8Gfiles in October 2001)
No business case to date
Challenge: scale DDBMS QP ideas to P2P


Motivate why
Pick the right parts of DBMS research to focus on


Storage: no! QP: yes.
Make it work:





Scalability well beyond our usual target
Admin constraints
Unknown data distributions, load
Heterogeneous comm/processing
Partial failure
Joint work with Scott Shenker, Ion Stoica, Matt
Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo
A Grassroots Example: TeleNap
Themes Throughout
Adaptivity

Requires clever system design


Interesting adaptive policy problems



The Exchange model: encapsulate in ops?
E.g. eddy routing, flux migration
Control Theory, Machine Learning
Encompasses another CS goal?

“No-knobs”, “Autonomic”, etc.
New performance regimes

Decent performance in the common case


Mean/Variance more important than MAX
Interactive Metrics

Time to completion often unimportant/irrelevant
More Themes
Set-valued thinking as albatross?




E.g. eddies vs. Kabra/DeWitt or Tukwila
E.g. SteMs vs. Materialized Views
E.g. CACQ vs. NiagaraCQ
Some clean theory here would be nice

Current routing correctness proofs are inelegant
Extensibility

Model/language of choice is not clear



SEQ? Relational? XQuery?
Extensible operators, edge semantics
[A whine about VLDB’s absurd “Specificity
Factor”]
Conclusions?
Too early for technical conclusions
Of this I’m sure:

The CS262 experiment is a success





Nets folks are coming our way


Our students are getting a bigger picture than before
I’m learning, finding new connections
May morph to OS/Nets, Nets/DB
Eventually rethink the systems software curriculum at
the undergraduate level too
Doing relevant work, eager to collaborate
DB community needs to branch out


Outbound: Better proselytizing in CS
Inbound: Need new ideas
Conclusions, cont.
Sabbatical is a good invention

Hasn’t even started, I’m already grateful!

Database/Network Convergence

Transcript Database/Network Convergence

Directory