(PHI) - KAIST

Download Report

Transcript (PHI) - KAIST


(Phi)
Timothy Roscoe, Joseph M. Hellerstein, Brent Chun, Nina Taft, Petros
Maniatis, Ryan Huebsch, Tyson Condie, Boon Thau Long
Intel Research at Berkeley and U.C. Berkeley
(much input from Tom Anderson, Vern Paxson, Larry Peterson, Scott
Shenker, Ion Stoica, and David Wetherall)
Intel Research
1
Lessons from PlanetLab…
 What have we learned from PlanetLab?
 We understand how to build robust large-scale systems
 E.g. structured overlay networks that scale to the planet
 What have we learned about the state of the
Internet?
 It’s brittle.
 It’s unpredictable.
 It’s doesn’t know what’s happening.
 It’s afraid.
Intel Research
2
The brittleness of the
Internet
 Security systems are afraid of the unknown.
 Everything new is unknown.
 Everything new is a threat.
 Better to shut it down now.
 Users really don’t comprehend the problems (and
why should they?)
 Not exactly easy to understand
 Very little information available
Intel Research
3
The brittleness of the
Internet
 Performance is unpredictable
 Failures, bottlenecks, congestion, misconfigurations, etc.
 It appears overlays can do better
 Provided they can measure the network.
 This shouldn’t work, but it does.
 IP protocols assume no information is available
about current network state.
Intel Research
4
Research trends and
directions
 Extensive network measurement / modelling
 Distributed security solutions
 Distributed performance diagnosis
 Machine learning across networks
 Measurement-based overlays
 Better Internet protocols
 Network visualizations
Intel Research
5
What’s missing?
Applications and services
User awareness
?
Measurement, monitoring, logging, etc.
of the real Internet
Intel Research
6
An “Information Plane” for the
Internet
 Continuous queries over distributed network state,
available to all end systems
 Integrate data from:
 Backbone monitoring
 Router configuration
 Network state databases (e.g. RouteViews)
 Security systems (Firewalls, DShield, Autograph, etc.)
 End-system monitoring
Intel Research
7
The big picture
sensor
INTERNET
Types of Clients
End users
End applications
Overlays
Types of sensors
sensor
sensor
End-systems
Backbone monitors
Routers
Network databases
Firewall logs
Intel Research
8
The big picture
query
plan
query
plan
sensor
disseminate
query
sensor
sensor
sensor
sensor
Intel Research
9
The big picture
query
execution
query
execution
Answer
sensor
Query
results
sensor
sensor
sensor
sensor
Intel Research
10
Implications of success
 Short-term: Enable & Connect measurement & security
researchers
 E.g. “Live DShield”, top 10 IP address result from Barford et.al
 Promote user-awareness through downloadable tools
 Medium-term: Provide global network knowledge for
planetary-scale applications & overlays
 E.g. Resource discovery on PlanetLab, OpenDHT could exploit NW
link information
 Long-term: Kick off a new generation of Network-Aware
Internet Protocols
 E.g. Host-based source routing solutions
Intel Research
11
Phi goals
 Create the missing piece of the information plane by building
a scalable, distributed dataflow engine for processing
continuous queries in-network
 Data tuples are
 Routed between nodes along a dataflow graph
 Processed at nodes (filtering, aggregation, data reduction, correlation,
result dissemination)
Declarative
Queries
Abstract
Dataflow
(Query Plan)
Physical Dataflow
in Overlay Network
Physical Network
Intel Research
12
The hard problems
 Scale: Millions of sources, sinks, queries
 Linear scaling on a n3 problem : need to factor out n2 redundant
communication & computation
 Fidelity & Security
 Bad inputs: data poisoning, perturbed computations
 Bad outputs: launchpads, vulnerability detection
 Efficiently embedding analysis algorithms in network
topologies
 Data must be combined (hence moved around the network) according
to the distributed analysis algorithm
Intel Research
13
The rest of the talk
 Where we are today
 PIER: distributed relational query processor
 Single query, many sources, many sinks
 Deployed on PlanetLab for the last 12 months
 Where we intend to go
 P2: full dataflow engine with multiquery scaling
 Topological Fault Tolerance
 Develop embeddings of distributed analysis algorithms
Intel Research
14
Key technology: Structured
overlay networks (DHTs)
• E.g. Chord, Pastry,
Tapestry, CAN,
Kademlia...
• Flat, sparse ID space
(e.g. 160-bit identifiers)
• Routing in log(n) hops
routing to the owner of
any key
• Based on “interesting”
routing graphs
Intel Research
15
What can DHTs do?
• Content-Based Routing
– i.e. send a message to a
key
– Equivalent to hashing a
key to a node
• Storage
– Storing values in the
network under a key
• Tree construction
– Formed by routing to a
key from all nodes
0
15
1
14
2
13
3
12
4
11
5
10
6
9
8
7
Intel Research
16
What can DHTs do?
• Content-Based Routing
– i.e. send a message to a
key
– Equivalent to hashing a
key to a node
• Storage
– Storing values in the
network under a key
• Tree construction
– Formed by routing to a
key from all nodes
0
15
1
14
2
13
3
12
4
11
5
10
6
9
8
7
Intel Research
17
Using DHTs in Phi
 Query Dissemination (trees)
 Hierarchical Aggregation (trees and storage)
 Indexing (routing and storage)
 Range Indexing Substrate (routing and storage)
 Hash-partitioned parallelism (routing)
 Hash tables for group-by, join (storage)
Intel Research
18
Bamboo: our DHT
(Sean Rhea)
 Pastry-style routing
 Epidemic propagation of leaf sets, routing tables
 Recursive routing
 Adaptive timeouts based on continuous
measurement
 Highly robust under churn
 Tested to ~1000 nodes
 PlanetLab, ModelNet
Intel Research
19
PIER: a relational query
engine
 Data is tuples in named tables
 Tables exist on nodes
 Relational operators:
 Selection
 Projection
 Join (correlate, intersect, match)
 Aggregation (summarize, compress, group by)
 Also has recursive queries
 Can query topological structures
Intel Research
20
PIER architecture
Declarative
Queries
Query Plan
Overlay Network
Physical Network
Network
Monitoring
Other User
Apps
Applications
Query
Optimizer
Catalog
Manager
Core
Relational
Execution
Engine
PIER
DHT
Wrapper
Storage
Manager
IP
Network
Overlay
Routing
DHT
Network
Intel Research
21
Experience so far
• PIER has run on
PlanetLab for about
a year
• Querying PlanetLab
sensors, in particular
Snort events
Intel Research
22
Experience so far
 Use of DHT for query processing by-and-large
works
 Need story for NATs, non-transitive connectivity
 Node heterogeneity
 Multiresolution emulation is essential
 Simulation, emulation (ModelNet), deployment
(PlanetLab)
 Simple results are quite compelling
 E.g. top 10 attackers demo for IDF
Intel Research
23
P2
 Build on PIER techniques
 Reimplementation in C++
 Extend beyond relational operators
 Synopses/sketches, junction trees, Bayes nets, PCA,..
 Address multiquery optimization (2 factors of n)
 Investigate using the overlay for data fidelity
 Codify communication and computation for a variety
of algorithms
Intel Research
24
Multiquery optimization

Granular lineage for data inputs and intermediate data products
 Telegraph: Tuple lineage bitmaps (operators & queries)
 Scaling via cluster analysis: bits name sets of queries/operators
0
15

14
Embedded in the network
 Multi-operator query  mesh of multiple trees is formed
1
2
13
3
12
4
 Optimizations in routing & replicating intermediate results
11

5
Scaling result dissemination
10
 Multicast from within the MQO mesh
6
9
8
7
 A many-source/many-sink multicast problem
 Tie-ins with MQO: Can choose the multicast source points as part of query optimization
Intel Research
25
Ephemeral overlay
topologies
 DHTs emulate InterConnect Networks
 These have deep algebraic structure
 Based on group-theoretic graph constructs
 Rich families of such graphs with different properties
 We can exploit the structure (i.e. constraints) of the overlay
 To embed complex computations with efficient communication
 To reason about the “influence” of malicious nodes in the network
 We could choose ephemeral topologies to suit specific analysis algorithms
Intel Research
26
Topological Fault Tolerance

Fidelity and Security
influence: 8 nodes

Diversifying Influence
 Reundant computation (a la process pairs)
applied in an adversarial environment
 Structured overlay topologies admit analysis
of spheres of influence

Two Dimensions to Diversity
 In Space: Multiple trees, different roots
 In Time: Reassign node IDs to change spheres
of influence
influence: 1 node
Intel Research
27
Design Patterns for NetworkEmbedded Data Analysis
 Taxonomize and abstract comm patterns for in-network analyses
binomial tree
 We already understand how some of these map to DHTs
0
15
 Up-tree aggregation (AVG, SUM, etc.)
14
 Up-a-special-tree aggregation (Haar Wavelets)
 Arbitrary dissemination (e.g. MIN, MAX, Gibbons-style
duplicate-insensitive sketching)
1
2
13
3
12
 Lessons from sensornet arena (topology-oblivious)
 First cut taxonomy of aggrs (TAG)
4
11
5
10
 Junction Trees for distributed inference: Up-then-down a special tree
6
9
8
7
 What all this means
 Algebraic properties of comm patterns dictate an extensibility API
 Expose only enough of alg logic to achieve optimization, code reuse, resource control
 Efficient comm patterns can drive research into new analysis techniques
Intel Research
28
Collaborations
• Network Measurement
Community
– End-host packet traces (KAIST VMS,
NETI@home, DIMES)
– Firewall log repositories (Snort, Bro,
DShield, Domino)
– Backbone monitors and repositories
(CAIDA, CoMo)
– Network tomography (RouteViews,
NetTelescope)
• Distributed Algorithms
Community
– Summarization / data reduction
(IRP, Bell Labs)
– Inference / anomaly detection
(CMU, UCB, IR)
– Signature detection (IRP,
EarlyBird)
– Joins / correlations (UCB, ICSI)
Value proposition: reusable backplane for
–
–
–
–
Real-time data summarization & transport
Data validation (against other sources)
Correlation with other data
Algorithm design and deployment
Intel Research
29
The Plan (1)
 Build a distributed peer-to-peer dataflow engine
 Define protocols:
 Tuple transfer protocol
 Dataflow signalling protocol
 Instantiate the “right” overlay
 Address multiquery optimization
 Rich aggregations/summarization and joins/correlations
 Explore topological diversity in time and space
 Identify efficient, realizable families
 Establish feasible timescales for topology construction
 Apply to both topological FT and net-embedded computations
Intel Research
30
The Plan (2)
 Deploy an initial information plane, starting on PlanetLab and building out
 Multiple classes of data sources:
 End-system monitoring (e.g. Neti@home)
 Link monitors (e.g. CoMo)
 Network Telescopes (dark address space)
 Databases / archives (e.g. Routeviews)
 Build example applications ourselves
 Implement example analysis operators: wavelet, PCA, etc.
 Enable others to more easily build applications
 Client libraries
 Query handholding
Intel Research
31
Many thanks!
[email protected]
Intel Research
32