Transcript Document
PlanetLab: Evolution vs Intelligent Design in Global Network Infrastructure Larry Peterson Princeton University Case for PlanetLab Maturity Deployed Future Internet Small Scale Testbeds Simulation and Research Prototypes Foundational Research Time This chasm is a major barrier to realizing the Future Internet PlanetLab QuickTime™ and and aa QuickTime™ TIFF(Uncompressed) (Uncompressed) decompressor decompressor TIFF are needed needed to to see see this this picture. picture. are • 637 machines spanning 302 sites and 35 countries nodes within a LAN-hop of > 2M users • Supports distributed virtualization each of 350+ network services running in their own slice Slices Slices Slices User Opt-in Client NAT Server Per-Node View Node Mgr Local Admin VM1 VM2 … VMn Virtual Machine Monitor (VMM) Long-Running Services • Content Distribution – CoDeeN: Princeton – Coral: NYU – Cobweb: Cornell • Internet Measurement – ScriptRoute: Washington, Maryland • Anomaly Detection & Fault Diagnosis – PIER: Berkeley, Intel – PlanetSeer: Princeton • DHT – Bamboo (OpenDHT): Berkeley, Intel – Chord (DHash): MIT Services (cont) • Routing – i3: Berkeley – Virtual ISP: Princeton • DNS – CoDNS: Princeton – CoDoNs: Cornell • Storage & Large File Transfer – LOCI: Tennessee – CoBlitz: Princeton – Shark: NYU • Multicast – End System Multicast: CMU – Tmesh: Michigan Usage Stats • • • • • • Slices: 350 - 425 AS peers: 6000 Users: 1028 Bytes-per-day: 2 - 4 TB IP-flows-per-day: 190M Unique IP-addrs-per-day: 1M Architectural Questions • What is the PlanetLab architecture? – more a question of synthesis than cleverness • Why is this the right architecture? – non-technical requirements – technical decisions that influenced adoption • What is a system architecture anyway? – how does it accommodate change (evolution) Requirements 1) Global platform that supports both short-term experiments and long-running services. – services must be isolated from each other • • performance isolation name space isolation – multiple services must run concurrently Distributed Virtualization – each service runs in its own slice: a set of VMs Requirements 2) It must be available now, even though no one knows for sure what “it” is. – deploy what we have today, and evolve over time – make the system as familiar as possible (e.g., Linux) Unbundled Management – independent mgmt services run in their own slice – evolve independently; best services survive – no single service gets to be “root” but some services require additional privilege Requirements 3) Must convince sites to host nodes running code written by unknown researchers. – protect the Internet from PlanetLab Chain of Responsibility – explicit notion of responsibility – trace network activity to responsible party Requirements 4) Sustaining growth depends on support for autonomy and decentralized control. – sites have the final say about the nodes they host – sites want to provide “private PlanetLabs” – regional autonomy is important Federation – universal agreement on minimal core (narrow waist) – allow independent pieces to evolve independently – identify principals and trust relationships among them Requirements 5) Must scale to support many users with minimal resources available. – expect under-provisioned state to be the norm – shortage of logical resources too (e.g., IP addresses) Decouple slice creation from resource allocation Overbook with recovery – support both guarantees and best effort – recover from wedged states under heavy load Tension Among Requirements • Distributed Virtualization / Unbundled Management – isolation vs one slice managing another • Federation / Chain of Responsibility – autonomy vs trusted authority • Under-provisioned / Distributed Virtualization – efficient sharing vs isolation • Other tensions – support users vs evolve the architecture – evolution vs clean slate Synergy Among Requirements • Unbundled Management – third party management software • Federation – independent evolution of components – support for autonomous control of resources Architecture (1) • Node Operating System – isolate slices – audit behavior • PlanetLab Central (PLC) – remotely manage nodes – bootstrap service to instantiate and control slices • Third-party Infrastructure Services – – – – monitor slice/node health discover available resources create and configure a slice resource allocation Trust Relationships Princeton Berkeley Washington MIT Brown CMU NYU ETH Harvard HP Labs Intel NEC Labs Purdue UCSD SICS Cambridge Cornell … Trusted Intermediary NxN (PLC) princeton_codeen nyu_d cornell_beehive att_mcash cmu_esm harvard_ice hplabs_donutlab idsl_psepr irb_phi paris6_landmarks mit_dht mcgill_card huji_ender arizona_stork ucb_bamboo ucsd_share umd_scriptroute … Trust Relationships (cont) 2 4 Node Owner PLC 3 1 Service Developer (User) 1) PLC expresses trust in a user by issuing it credentials to access a slice 2) Users trust to create slices on their behalf and inspect credentials 3) Owner trusts PLC to vet users and map network activity to right user 4) PLC trusts owner to keep nodes physically secure Trust Relationships (cont) 4 Node Owner 6 Mgmt Authority 3 2 Slice Authority 5 1 Service Developer (User) 1) PLC expresses trust in a user by issuing credentials to access a slice 2) Users trust to create slices on their behalf and inspect credentials 3) Owner trusts PLC to vet users and map network activity to right user 4) PLC trusts owner to keep nodes physically secure 5) MA trusts SA to reliably map slices to users 6) SA trusts MA to provide working VMs Architecture (2) PlanetLab Nodes Service Developers Owner 1 Create slices Slice Authority Owner 2 Owner 3 Software updates Identify slice users (resolve abuse) Management Authority ... ... Auditing data New slice ID USERS Learn about nodes Request a slice Owner N Access slice Architecture (3) MA Node Owner Owner VM NM + VMM Node SCS slice database SA node database VM Service Developer Per-Node Mechanisms SliverMgr Proper Node Mgr Owner VM VM1 VM2 … VMn PlanetFlow SliceStat pl_scs pl_mom Virtual Machine Monitor (VMM) Linux kernel (Fedora Core) + Vservers (namespace isolation) + Schedulers (performance isolation) + VNET (network virtualization) VMM • Linux – significant mind-share • Vserver – scales to hundreds of VMs per node (12MB each) • Scheduling – CPU • fair share per slice (guarantees possible) – link bandwidth • fair share per slice • average rate limit: 1.5Mbps (24-hour bucket size) • peak rate limit: set by each site (100Mbps default) – disk • 5GB quota per slice (limit run-away log files) – memory • no limit • pl_mom resets biggest user at 90% utilization VMM (cont) • VNET – socket programs “just work” • including raw sockets – slices should be able to send only… • well-formed IP packets • to non-blacklisted hosts – slices should be able to receive only… • packets related to connections that they initiated (e.g., replies) • packets destined for bound ports (e.g., server requests) – essentially a switching firewall for sockets • leverages Linux's built-in connection tracking modules – also supports virtual devices • standard PF_PACKET behavior • used to connect to a “virtual ISP” Node Manager • SliverMgr – creates VM and sets resource allocations – interacts with… • bootstrap slice creation service (pl_scs) • third-party slice creation & brokerage services (using tickets) • Proper: PRivileged OPERations – grants unprivileged slices access to privileged info – effectively “pokes holes” in the namespace isolation – examples • • • • files: open, get/set flags directories: mount/unmount sockets: create/bind processes: fork/wait/kill Auditing & Monitoring • PlanetFlow – logs every outbound IP flow on every node • accesses ulogd via Proper • retrieves packet headers, timestamps, context ids (batched) – used to audit traffic – aggregated and archived at PLC • SliceStat – has access to kernel-level / system-wide information • accesses /proc via Proper – used by global monitoring services – used to performance debug services Infrastructure Services • Brokerage Services – Sirius: Georgia – Bellagio: UCSD, Harvard, Intel – Tycoon: HP • Environment Services – Stork: Arizona – AppMgr: MIT • Monitoring/Discovery Services – – – – CoMon: Princeton PsEPR: Intel SWORD: Berkeley IrisLog: Intel Evolution vs Intelligent Design • Favor evolution over clean slate • Favor design principles over a fixed architecture • Specifically… – leverage existing software and interfaces – keep VMM and control plane orthogonal – exploit virtualization • vertical: mgmt services run in slices • horizontal: stacks of VMs – give no one root (least privilege + level playing field) – support federation (decentralized control) Other Lessons • • • • • • • Inferior tracks lead to superior locomotives Empower the user: yum Build it and they (research papers) will come Overlays are not networks PlanetLab: We debug your network From universal connectivity to gated communities If you don’t talk to your university’s general counsel, you aren’t doing network research • Work fast, before anyone cares Collaborators • • • • • • • • • • Andy Bavier Marc Fiuczynski Mark Huang Scott Karlin Aaron Klingaman Martin Makowiecki Reid Moran Steve Muir Stephen Soltesz Mike Wawrzoniak • • • • • • • • • • David Culler, Berkeley Tom Anderson, UW Timothy Roscoe, Intel Mic Bowman, Intel John Hartman, Arizona David Lowenthal, UGA Vivek Pai, Princeton David Parkes, Harvard Amin Vahdat, UCSD Rick McGeer, HP Labs Available CPU Capacity 1202005 (Week before SIGCOMM deadline) Feb 1-8, Pct of 360 Nodes 100 80 60 40 20 0 10 20 30 40 50 Pct of CPU Available 60 70 80 Node Boot/Install Node Boot Manager PLC (MA) Boot Server 1. Boots from BootCD (Linux loaded) 2. Hardware initialized 3. Read network config . from floppy 4. Contact PLC (MA) 6. Execute boot mgr 5. Send boot manager 7. Node key read into memory from floppy 8. Invoke Boot API 9. Verify node key, send current node state 10. State = “install”, run installer 11. Update node state via Boot API 13. Chain-boot node (no restart) 14. Node booted 12. Verify node key, change state to “boot” Chain of Responsibility Join Request PI submits Consortium paperwork and requests to join PI Activated PLC verifies PI, activates account, enables site (logged) User Activated Users create accounts with keys, PI activates accounts (logged) Slice Created PI creates slice and assigns users to it (logged) Nodes Added to Slices Slice Traffic Logged Traffic Logs Centrally Stored Users add nodes to their slice (logged) Experiments generate traffic (logged by PlanetFlow) PLC periodically pulls traffic logs from nodes Network Activity Slice Responsible Users & PI Slice Creation . . . PI SliverCreate(rspec) SliceCreate( ) SliceUsersAdd( ) User PLC (SA) NM VM VM … VM SliceAttributeSet( ) SliceGetTicket( ) VMM . . . (distribute ticket to slice creation service: pl_scs) Brokerage Service . . . rcap = PoolCreate(rspec) SliceAttributeSet( ) SliceGetTicket( ) PLC (SA) NM VM VM … VM Broker VMM . . . (distribute ticket to brokerage service) Brokerage Service (cont) . . . PoolSplit(rcap, slice, rspec) PLC (SA) User BuyResources( ) NM VM VM VM … VM VMM Broker . . . (broker contacts relevant nodes)