Transcript Document
Surviving Large Scale Internet Failures Dr. Krishna Kant Intel Corp. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 1 The Problem • Internet has two critical elements – Routing (Inter & intra domain) – Name resolution • How robust are they against large scale failures/attacks? • How do we improve them? 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 2 Outline 1. Overview – Internet Infrastructure elements & Large Scale Failures 2. Dealing with Routing Failures 1. Routing algorithms & their properties 2. Improving BGP Convergence 3. Other Performance Metrics 3. Dealing with Name Resolution Failures 1. Name resolution preliminaries 2. DNS vulnerabilities & Solution 4. Conclusions and Open Issues 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 3 Internet Routing • Not a homogeneous network – A network autonomous systems (AS) – Each AS under the control of an ISP. – Large variation in AS sizes – typical heavy tail. • Inter-AS routing – Border Gateway Protocol (BGP). A path-vector algorithm. – Serious scalability/recovery issues. • Intra-AS routing – Several algorithms; usually work fine inter-domain router intra-domain router • Central control, smaller network, … 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 4 Internet Name Resolution • Domain Name Server (DNS) – May add significant delays – Replication of TLDs & others resists attacks, but extensive caching makes it easier! – Not designed for security - can be easily attacked. • DNS security – Crypto techniques can stop many attacks, but substantial overhead & other challenges. • Other solutions – Peer to peer based, but no solution is entirely adequate. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 5 Large Scale Failures • Characteristics: – Affects a significant % of infrastructure in some region • Routers, Links, Name servers – Generally non-uniformly distributed, e.g., confined to a geographical area. • Why study large scale failures? – Several instances of moderate sized failures already. • Larger scale failures only a matter of time – Potentially different behavior • Secondary failures due to large recovery traffic, substantial imbalance in load, … 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 6 Routing Failure Causes • • • • Large area router/link damage (e.g., earthquake) Large scale failure due to buggy SW update. High BW cable cuts Router configuration errors – Aggregation of large un-owned IP blocks • Happens when prefixes are aggregated for efficiency – Incorrect policy settings resulting in large scale delivery failures • Network wide congestion (DoS attack) • Malicious route advertisements via worms 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 7 Name Resolution Failure Causes • Records containing fake (name, IP) info can be easily altered. – “Poisoning” of records doesn’t even require compromising the server! – Extensive caching More points of entry. • Poisoning of TLD records (or other large chunks of name space) – Disable access to huge number of sites – Example: March 2005 .com attack • Poisoning a perfect way to gain control of sensitive information on a large scale. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 8 Major Infrastructure Failure Events • Blackout – widespread power outage (NY & Italy 2003) • Hurricane – widespread damage (Katrina) • Earthquake – Undersea cable damage (Taiwan Dec 2006) • Infrastructure induced (e.g., May 2007, Japan) • Many other potential causes 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 9 Taiwan Earthquake Dec 2006 • Issues: – Global traffic passes through a small number of seismically active choke points. • Luzon strait, Malacca strait, South coast of Japan – Satellite & overland cables don’t have enough capacity to provide backup. – Several countries depend on only 1-2 distinct landing points. • Outlook – Economics makes change unlikely. – May be exploited by collusion of pirates + terrorists – Will perhaps see repeat performance! • Reference: http://www.pinr.com/report.php?ac=view_report&report_id=602&language_id=1 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 10 Hurricane Katrina (Aug 2005) • Major local outages. No major regional cable routes through the worst affected areas. • Outages persisted for weeks & months. Notable aftereffects in FL (significant outages 4 days later!) • Reference: http://www.renesys.com/tech/presentations/pdf/Renesys-Katrina-Report9sep2005.pdf 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 11 NY Power Outage (Aug 2003) • No of concurrent network outages vs. time – Large ASes suffered less than smaller ones. – Behavior very similar to Italian power outage of Sept 2003. • A significant no of ASes had all their routers down for >4 hours. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 12 Slammer Worm (Jan 2003) • Scanner worm started w/ buffer overflow of MS SQL. – Very rapid replication, huge congestion buildup in 10 mins – Korea falls out, 5/13 DNS root servers fail, failed ATMs, … • High BGP activity to find working routes. • Reference: http://www.cs.ucsd.edu/ savage/papers/IEEESP03.pdf 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 13 Infrastructure Induced Failures • En-masse use of backup routes by 4000 Cisco routers in May 2007 (Japan) – Routing table rewrites caused 7 hr downtime in NE Japan. – Reference: http://www.networkworld.com/news/2007/051607-cisco-routersmajor-outage-japan.html • Akamai CDN failure – June 2004 – Probably widespread failures in Akamai’s DNS. – Reference: http://www.landfield.com/isn/mail-archive/2004/Jun/0064.html • Worldcom router mis-configuration – Oct 2002 – Misconfigured eBGP router flooded internal routers with routes. – Reference: http://www.isoc-chicago.org/internetoutage.pdf 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 14 Outline 1. Overview – Internet Infrastructure elements & Large Scale Failures 2. Dealing with Routing Failures 1. Routing algorithms & their properties 2. Improving BGP Convergence 3. Other Performance Metrics 3. Dealing with Name Resolution Failures 1. Name resolution preliminaries 2. DNS vulnerabilities & Solution 4. Conclusions and Open Issues 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 15 Routing Algorithms • Basic methods – Distance vector based (DV) – Link State Based (LS) – Path Vector Based (PV) • DV Examples – RIP (Routing Information Protocol). – IGRP (Interior gateway routing Protocol). • LS Examples – OSPF (Open shortest path first) – IS-IS (Intermediate system to IS) • PV Examples – BGP (Border Gateway Protocol) – There are inter-domain (iBGP) & inter-domain (eBGP) versions. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 16 Distance Vector (DS) Protocols • Each node advertises its path costs to its neighbors. • Very simple but “count to infinity” problem – Node w/ a broken link will receive old cost & use it to replace broken path! – Several versions to fix this. • Difficult to use policies Routing Table for A Dest Next Cost B B 1 C C 1 D B 4 E B 2 F C 3 7/16/2015 B D E A C K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial F 17 Link State (LS) Protocols • Each node keeps complete adjacency/cost matrix & computes shortest paths locally – Difficult in a large network • Any failure propagated via flooding – Expensive in a large network • Loop-free & can use policies easily. Src A A B B C C C 7/16/2015 Dest B C A D A D E Link 1 2 2 3 2 4 5 Cost 1 1 1 2 1 1 3 B 1 A 3 D 4 6 2 C 5 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial E 18 Path Vector Protocols • Each node initialized w/ some paths for each dest. • Active paths updated much like in DV – Explicitly withdraw failed paths (& advertise next best) • Filtering on incoming/outgoing paths, path selection policies Paths A to D: • Via B: B/E/F/D, cost 3 • Via C: C/E/F/D, cost 4 B A Link_cost=2 C 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial D E F 19 Intra-domain Routing under Failures • Inter-domain routing usually can limp back to normal rather quickly – Single domain of control • High visibility, common management network, etc. – Most ASes are small – Very simple policies only • Routing algorithms – Distance-vector (RIP, IGRP) – simple, enhancements prevent most count-to-infinity problems – Link state (OSPF): Flooding handles failures quickly. – Path vector (iBGP): Behavior similar to eBGP 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 20 Inter-domain Routing • BGP: Default inter-AS protocol (RFC 1771) • Path vector protocol, runs on TCP – Scalable, “rich” policy settings • But prone to long “convergence delays” – High packet loss & delay during convergence R I-BGP border router internal router AS3 R 3 R 2 AS1 A R 1 7/16/2015 IGP E-BGP AS2 announce B K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial R 5 R 4 B 21 BGP Routing Table • Prefix: origin address for dest & mask (eg.,207.8.128.0/17) • Next hop: Neighbor that announced the route Dest prefix Next hop Cost 204.70.0.0/15 207.240.10.143 10 192.67.95.0/24 192.67.95.57 140.222.0.0/16 207.240.10.120 5 2 • One active route, others kept as backup • Route “attributes” -- some may be conveyed outside • ASpath: Used for loop avoidance. • MED (multi-exit discriminator); preferred incoming path • Local pref: Used for local path selection 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 22 BGP Messages • Message Types – Open (establish TCP conn), notification, update, keepalive • Update – Withdraw old routes and/or advertise new ones. – Only active routes can be advertised. – May need to also advertise sub-prefix (e.g., 207.8.240.0/24 which is contained in 207.8.128.0/17) Withdrawn route lengths (2 octets) Withdrawn routes (variable length) Length of path all attributes (2 octets) Advertised path attributes (variable length) Reachability Information (variable length) 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 23 Routing Process accept, deny, set preferences Routes received from peers Input policy engine BGP decision process Route pkts BGP routing table IP routing Table forward, not forward set MEDs Output policy engine Routes sent to peers • Input & output policy engines – Filter routes (by attributes, prefix, etc.) – Manipulate attributes (eg. Local pref, MED, etc.) 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 24 BGP Recovery – Time for ALL routes to stabilize. – 4 different times defined! • BGP parameters – Minimum Route Advertisement Interval (MRAI) – Path cost, path priority, input filter, output filter, … • MRAI specifics – Applies only to adv., not withdrawals – Intended – per destination, Implemented – per peer – Damps out cycles of withdrawals & advertisements. – Convergence delay vs. MRAI: A Vshaped curve Symb Convergence Condition Tup A down node/link restored Tshort A shorter path advertised Tlong Switchover to longer paths due to a link failure Tdown All paths to a failed node withdrawn Convergence Delay • BGP Convergence Delay MRAI 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 25 Impact of BGP Recovery • Long Recovery Times • >3 minutes for 30% of isolated failures • > 15 minutes for 10% of cases – Even larger for large scale failures. 90 Cumulative Percentage of Events – Measurements for isolated failures 100 80 70 60 Tup Tshort 50 Tlong 40 Tdow n 30 20 10 0 0 20 40 60 80 100 120 140 160 Seconds Until Convergence • Consequences 50 Percent Packet Loss – Connection attempts over invalid routes will fail. – Long delays & compromised QoS due to heavy packet loss. – Packet loss: 30X increase, delay: 4X 60 40 Tlong 30 Tshort Fault 20 10 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 One Minute Bins Before and After Fault Graphs taken from ref #2, Labovitz, et.al. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 26 BGP Illustration (1) • Example Network H E G F I D 2 3 A B 10 C – All link costs =1 except as shown. • Notation for best path PSD=(N, cost) [X] – S,D: Source & destination nodes – N: Neighbor of S thru which the path goes – X: Actual path (for illustration only) • Sample starting paths to C H E G F I D 2 3 A B C 7/16/2015 – PBC=(D,3) [BDAC], PDC=(A,2) [DAC], PFC=(E,3) [FEAC], PIC=(H,5) [IHGAC] – Paths shown using arrows (all share seg AC) • Failure of A – BGP does not attempt to diagnose problem or broadcast failure events. 10 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 27 BGP Illustration (2) H E G F I D 2 3 A B C E • D advertises PDC =[DBFEAC] to B F I D 2 3 A B C 7/16/2015 • A’s neighbors install new paths avoiding A – PDC=(B,5) [DBFEAC], PEC=(EF,5) [EFBDAC], PGC=(H,6) [GHIBDAC] 10 H G • NOTE: Affected node names in blue, rest in white 10 – Current PBC is via D B must pick a path not via D – B installs PBC=(F,4) [BFEAC] & advertises it to F & I (First time) • Note: Green indicates first adv by B K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 28 BGP Illustration (3) H E G F I D 2 3 A B 10 C E F I D 2 3 A B C 7/16/2015 10 – Current PFC is via E – F installs PFC=(B,4) [FBDAC] & advertises to E & B • G advertises PGC =[GHIBDAC] to H H G • E advertises PEC = [EFBDAC] to F – Current PHC is via H – H installs PHC=(I,5) [HIBDAC] & advertises to I K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 29 BGP Illustration (4) • B’s adv [BFEAC] reaches F & I H E G F D 2 – PFC=(B,4) [FBDAC] thru B F withdraws PFC & has no path to C! – PIC=(H,5) [IHGAC] is shorter I retains it. I 3 A B 10 C • F’s adv [FBDAC] reaches B: PBC=(F,4) [BFEAC] thru F – B installs PBC=(I,6) [BIHGAC] and advertises to D, F & I H E G F I D 2 3 A B C 7/16/2015 10 • Note: Green text: B’s first adv; Grey text: B’s subsequent adv. (disallowed by MRAI) K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 30 BGP Illustration (5a) • H’s adv [HIBDAC] reaches I H E G F I D 2 3 A B C 10 • B’s adv [BIHGAC] reaches D, F – D updates PDC=(B,8) [DBIHGAC] (Just a local update) – F updates PFC=(B,8) [FBIHGAC] & advertises to E H E G F I D 2 3 A B C 7/16/2015 10 – PIC=(H,5) [IHGAC] thru H I installs PIC=(B,6) [IBDAC] & advertises to B & H. • w/ MRAI – D & F have wrong (lower) cost metric, but will still follow the same path thru. B. K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 31 BGP Illustration (5b) • B’s adv [BIHGAC] reaches I H E G F I D 2 3 A B • I’s adv [IBDAC] reaches B & H: H E G F I D 2 3 A B C 7/16/2015 • w/ MRAI – I will continue to use the nonworking path IBDAC. Same as having no path. 10 C – PIC=(B,6) [IBDAC] thru B I withdraws PIC & has no path to C! – H changes its path to [HIBDAC] – B’s path thru I, so B installs (C,10) & advertises to its neighbors D, F & I 10 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 32 BGP Illustration (5c) • F’s update reaches E H E G F I D 2 3 A B • H’s withdrawal of [HIBDAC] reaches G (& also I) H E G F I D 2 3 A – G withdraws the path GHIBDAC & has no path to C! B C 7/16/2015 • I’s withdrawal of [IBDAC] reaches H (& also B) – H withdraws the path IBDAC & has no path to C! 10 C – E updates its path locally. 10 • w/ MRAI – Nonworking paths stay at E, H & G K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 33 BGP Illustration (6) – No MRAI H E G F I D 3 A B C 10 E F I D 3 A B C – D updates its path cost (B,11) – F updates its path & cost (B,11) & advertises PFC to E. – I updates its path cost (B,13) & advertises PIC to H • Final updates H G • B’s adv [C] reaches D, F & I (in some order) 10 – F’s update [FBC] reaches E which updates its path locally – I’s adv [IBC] reaches H • H updates its path & cost (I,14) [HIBC] & advertises PHC to G – G does a local update 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 34 BGP Illustration (5’) – w/ MRAI • H’s adv [HIBDAC] reaches I H E G F I D 2 3 A B C 10 • I’s adv [IBDAC] reaches B & H: – H changes its path to [HIBDAC] – B’s path is thru I, so B installs (C,10) – When MRAI expires, B advertises to its neighbors D, F & I H E G F I D 2 3 A B C 7/16/2015 10 – PIC=(H,5) [IHGAC] thru H I installs PIC=(B,6) [IBDAC] & advertises to B & H. • Note: If MRAI is large, path recovery gets delayed K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 35 BGP Illustration (6’) – w/ MRAI • B’s adv [C] reaches D, F & I (in some order) H E G F I D 2 3 A B C 10 – D updates its path cost (B,11) – F updates its path & cost (B,11) & advertises PFC to E. – I installs updated path [IBC] and advertises it to H • Final updates: Same as for (6) H E G F I D 3 A B C 7/16/2015 • W/ vs. w/o MRAI: – MRAI avoids some unnecessary path updates (less router load) 10 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 36 BGP: Known Analytic Results • Lots of work for isolated failures – Labovitz [1]: • Convergence delay bound for full mesh networks: O(n3) for average case, O(n!) for worst case. – Labovitz [2], Obradovic [3], Pei[8]: • Assuming unit cost per hop • Convergence delay Length of longest path involved – Griffin and Premore [4]: • V shaped curve of convergence delay wrt MRAI. • #Messages wrt MRAI decreases at a decreasing rate. • LS failures: Even harder! 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 37 Evaluation of LS Failures • Evaluation methods – Primarily simulation. Analysis is intractable • BGP Simulation Tools – Several available, but simulation expense is the key! – SSFNET – scalable, but max 240 nodes on 32-bit machine • SSFNet default parameter settings – MRAI but jittered by 25 % to avoid synchronization – OSPFv2 used as the intra-domain protocol 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 38 Topology Modeling • Topology Generation: BRITE – Enhanced to generate arbitrary degree distributions • Heavy tailed based on actual measurements. • Approx: 70% low & 30% high degree nodes. – Mostly used 1 router/AS Easier to see trends. • Failure topology: Geographical placement – Emulated by placing all AS routers and ASes on a 1000x1000 grid – The “area” of an AS No. of routers in AS 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 39 Convergence Delay vs. Failure Extent • Initial rapid increase & then flattens out. • Delays & increase rate both go up with network size Large failures can a problem! Convergence Delay vs. failure extent Convergence Delay (s) 1000 800 120 ASes 600 60 ASes 400 180 ASes 200 0 0 7/16/2015 2 4 6 8 % of Failed Routers 10 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 40 Delay & Msg Traffic vs. MRAI • Small networks in simulation – Optimal MRAI for isolated failures small (0.375 s). – Chose a few larger values • Main observations – Larger failure Larger MRAI more effective Effect of MRAI on conv delay & msg traffic 0.80 0.70 0.40 0.60 0.00 0.50 -0.40 0.40 0.30 MRAI 0.625 s, Conv. delay MRAI, 2.0 s, Conv. delay MRAI 0.625s, messages MRAI, 2.00 s, messages -0.80 -1.20 0.20 0.10 -1.60 0.00 0 2 4 6 8 10 improvement in #msgs Conv delay improvement 0.80 12 % of Failed Routers 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 41 Convergence Delay vs. MRAI • A V-shaped curve, as expected • Curve flattens out as failure extent increases • Optimal MRAI shifts to right with failure extent. Convergence delay vs. MRAI Convergence Delay (s) 100 10 0.5% Failure 1% Failure 2.5% Failure 5% Failure 1 0 7/16/2015 0.5 1 1.5 2 2.5 MRAI (s) K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 3 3.5 42 Impact of AS “Distance” • ASes more likely to be connected to other “nearby” ASes. b indicates the preference for shorter distances (smaller b higher preference) • Lower convergence delay for lower b. Distance based Connectivity Convergence Delay (s) 250 200 150 Default (β=∞) β=0.01 β=0.05 100 50 0 7/16/2015 2 4 6 8 % of Failed Routers K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 10 12 43 Outline 1. Overview – Internet Infrastructure elements & Large Scale Failures 2. Dealing with Routing Failures 1. Routing algorithms & their properties 2. Improving BGP Convergence 3. Other Performance Metrics 3. Dealing with Name Resolution Failures 1. Name resolution preliminaries 2. DNS vulnerabilities & Solution 4. Conclusions and Open Issues 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 44 Reducing Convergence Delays • Many schemes in the literature – Most evaluated only for isolated failures. • Some popular schemes – Ghost Flushing – Consistency Assertions – Root Cause Notification • Our work (Large scale failure focused) – Dynamic MRAI – Batching – Speculative Invalidation 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 45 Ghost Flushing • Bremler-Barr, Afek, Schwarz: Infocom 2003 • An adv. implicitly replaces old path – GF withdraws old path immediately. • Pros E G F • Cons – Substantial additional load on routers – Flushing takes away a working route! K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial I D 2 3 A B C – Withdrawals will cascade thru the network – More likely to install new working routes 7/16/2015 H 10 • Install BC Routes at D, F, I via B will start working • Flushing will take them away. 46 Consistency Assertion • Pei, Zhao, et.al., Infocom 2002 – If S has two paths S:N1xD & S:N2yN1xD, & first path is withdrawn, then second path is not used (considered infeasible). • Pros S N2 N1 y x – Avoids trying out paths that are unlikely to be working. D • Cons – Consistency Checking can be expensive 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 47 Root Cause Notification • Pei, Azuma, Massy, Zhang: Computer Networks, 2004 • Modify BGP messages to carry root cause (e.g., node/link failure). • Pros – Avoid paths w/ failed nodes/links substantial reduction in conv. delay. • Cons – Change to BGP protocol. Unlikely to be adopted. – Applicability to large scale failures unclear. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial H E G F I D 2 3 A B C 10 • D, E, G diagnose if A or link to A has failed. • Propagate this info to neighbors 48 Large Scale Failures Our Approach • What we can’t or wouldn’t do? – No coordination between ASes • Business issues, security issues, very hard to do, … – No change to wire protocol (i.e., no new msg type). – No substantial router overhead • Critical for large scale failures – Solution applicable to both isolated & LS failures. • What we can do? – Change MRAI based on network and/or load parms (e.g., degree dependent, backlog dependent, …) – Process messages (& generate updates) differently 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 49 Key Idea: Dynamic MRAI • Increase MRAI when the router is heavily loaded – Reduces load & #of route changes. • Relationship to large scale failure – Larger failure size Greater router loading Larger MRAI more appropriate. – Router load directed MRAI caters to all failure sizes! • Implementation: – Queue length threshold based MRAI adjustment. Increase th2 7/16/2015 Decrease th2 Increase th1 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial Decrease th1 50 Dynamic MRAI: Effect on Delay Change wrt fixed MRAI=9.375 secs. Improves convergence delay as compared to fixed values. Impact of Dynamic MRAI Improvement in Delay 0.6 0.4 0.2 0.0 -0.2 -0.4 MRAI = 0.625 s MRAI = 2.0 s Dynamic MRAI -0.6 -0.8 -1.0 0 2 4 6 8 10 12 % of Failed Routers 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 51 Key Idea: Message Batching • BGP default: FIFO message processing – Unnecessary processing, if • A later update (already in queue) changes route to dest. • Timer expiry before a later msg is processed. • Relationship to large scale failure – Significant batching (and hence batching advantage) likely for large scale failures only. • Algorithm – A separate logical queue/dest. – allows processing of all updates to dest as a batch. – >1 update from same neighbor Delete older ones. A B C AB A 7/16/2015 AAA K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial BB C 52 Batching: Effect on Delay Behavior similar to dynamic MRAI w/o actually making it dynamic Combination w/ dynamic MRAI works somewhat better. Impact of Batching Improvement in Delay 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 MRAI = 0.625 s MRAI = 2.0 s Batching (MRAI = 0.25 s) -0.8 -1.0 -1.2 0 7/16/2015 2 4 6 8 % of Failed Routers K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 10 12 53 Key Idea: Speculative Invalidation • Large scale failure – A lot of route withdrawals for the failed AS, say X – #withdrawn paths w/ AS X e AS_path > thres Invalidate all paths containing X • Implementation Issues – Going through the routes for invalidation is inefficient • Use route filters at each node – Threshold estimation Computed (see paper) – Reverting routes to valid state time-slot based 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 54 Effect of Invalidation • Avoids exploring unnecessary paths – Reduces conv. delay significantly, but … – May affect connectivity adversely. • Implement only at nodes with degree 4 or higher Lost Connectivity w/ Speculative Invalidation Improvement in Delay All Nodes Degree > 2 Degree > 4 Degree > 6 Degree > 8 0.8 0.7 0.6 0.5 Improvement in conn. Conv. Delay improvement w/ Speculative Invalidation 0.9 0.2 All Nodes Degree > 2 Degree > 4 Degree > 6 Degree > 8 0.1 0 -0.1 0.4 0 2 4 7/16/2015 6 8 10 12 Failure (%) 14 16 18 20 0 2 4 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 6 8 10 12 Failure (%) 14 16 18 20 55 Comparison of Various Schemes • CA is the best scheme throughout! • GF is rather poor • Batching & dynamic MRAI do pretty well considering their simplicity Comparison of Schemes Improvement in delay 1.0 Batching Scheme 0.8 0.6 0.4 Consis. assertion 0.2 Ghost Flushing 0.0 Dynamic MRAI -0.2 0 2 4 6 8 10 12 % of Failed Routers 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 56 Outline 1. Overview – Internet Infrastructure elements & Large Scale Failures 2. Dealing with Routing Failures 1. Routing algorithms & their properties 2. Improving BGP Convergence 3. Other Performance Metrics 3. Dealing with Name Resolution Failures 1. Name resolution preliminaries 2. DNS vulnerabilities & Solution 4. Conclusions and Open Issues 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 57 What’s the Right Performance Metric? • Convergence delay – Network centric, not user centric – Instability in infrequently used routes is almost irrelevant • User Centric Metrics – Packet loss & packet delays • Convergence delay does not correlate well with user centric metrics 0.15% 400 390 380 0.10% 370 360 0.05% 350 Packet Loss Convergence Delay 0.00% 330 1 7/16/2015 340 Convergence Delay (s) Overall packet loss prob loss-convDelay 2.5 5 Failure Size (%) K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 10 58 User Centric Metrics • Computed over all routes & entire convergence period – Single metric: Overall avg over routes & time – Distribution wrt routes, time dependent rate, etc. • Frac of pkts lost • Frac increase in pkt delay – Absolute delay depends on route length & not meaningful. – Requires E2E measurements Much harder than pkt loss 8.0% 2.0% Extra Packet Delay Packet Loss 1.6% 6.0% 1.2% 4.0% 0.8% 2.0% Packet Loss Rate Extra Packet Delay lossRate_PacketDelay_slot15 0.4% 0.0% 0.0% 7.5 37.5 67.5 97.5 127.5 Time (s) 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 59 Comparison between Schemes • Comparing some major schemes – Consistency assertion (CA) – Ghost Flushing (GF) – Speculative Invalidation (SI) • All 3 schemes reduce conv delay substantially, but … • Only CA can really reduce the pkt losses! Packet Loss Improvement Conv. delay improvement 25% T_conv (CA) 80% T_conv (GF) 75% Pkt loss change Conv. delay change 85% T_conv (SI) 70% 65% 60% Consistency Assertions 20% Ghost Flushing 15% Invalidation Scheme 10% 5% 0% 55% 50% -5% 0 2 4 6 Failure (%) 7/16/2015 8 10 12 0 2 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 4 6 Failure (%) 8 10 12 60 How Schemes affect routes • Cumulative time for which there is no valid path – T_noroute: Time for which there is no route at all – T_allinval: Time for which all neighbors advertise an invalid route – T_BGPinval: Time for which BGP chooses an invalid route (even though some neighbor has a valid route). • GF increases T_noroute the most, CA reduces T_allinval the most Change in T_allinval wrt normal BGP Change in T_noroute wrt normal BGP 0.95 T_NoRoute (GF) T_NoRoute (CA) 1.7 T_NoRoute (SI) 1.6 1.5 1.4 1.3 Ratio wrt normal BGP Ratio wrt normal BGP 1.8 0.9 0.85 0.8 0.75 T_AllInval (GF) T_AllInval (CA) 0.7 0.65 T_AllInval (SI) 0.6 1.2 0 7/16/2015 2 4 6 Failure (%) 8 10 0 2 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 4 6 Failure (%) 8 10 61 Changes to Reduce pkt Losses • GF: Difficult to reduce T_noroute. Not attempted. • CA: Use “best route” even if all of them are “infeasible”, but don’t advertise infeasible routes. – Improves substantially • SI: Mark the route invalid probabilistically depending on fail count (instead of deterministically) – Improves substantially Change in pkt loss rate Pkt loss performance of modified schemes 35% 25% 15% 5% -5% 0 2 Modified CA 7/16/2015 4 6 Failure (%) Modified SI 8 CA K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 10 SI 62 Conclusions & Open Issues • Inter-domain routing does not perform very well for large scale failures. – Considered several schemes for improvement. Room for further work. • Convergence delay is not the right metric – Defined pkt loss related metric & a simple scheme to improve it. • Open Issues for large scale failures – Analytic Modeling of convergence properties. – What aspects affect pkt losses & can we model them? – How do we improve pkt loss performance ? 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 63 Outline 1. Overview – Internet Infrastructure elements & Large Scale Failures 2. Dealing with Routing Failures 1. Routing algorithms & their properties 2. Improving BGP Convergence 3. Other Performance Metrics 3. Dealing with Name Resolution Failures 1. Name resolution preliminaries 2. DNS vulnerabilities & Solution 4. Conclusions and Open Issues 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 64 DNS Infrastructure Browser FTP E-Mail root Client Resolver au gov DNS Proxy Cache Local DNS Server 7/16/2015 gb sg nz edu ips sa Authoritative DNS Server K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 65 DNS Usage • • • • Name IP mapping Best-matching Time-to-Live (TTL) Iterative vs. recursive lookup • Delegation chains – Avg length 46! – Makes DNS very vulnerable Q: abc.gb.gov.au? Proxy cache root server root hint • .au .gov .ips .xyz • .sg .gb .nz .gov .au .edu .abc DNS Proxy .ips .sa .gb A: 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 66 DNS Resource Record (RR) Domain name Type Class TTL Variable 2 2 4 RL 2 RDATA Variable • Domain name: – (length, name) pairs, eg., intel.com 05intel03com00 • Record Types – DNS Internal types • Authority: NS, SOA; DNSSEC: DS, DNSKEY, RRSIG, … • Many others: TSIG, TKEY, CNAME, DNAME, … – Terminal RR: • Address records: A, AAAA • Informational: TXT, HINFO, KEY, … (data carried to apps) – Non Terminal RR: • MX, SRV, PTR, … w/ domain names resulting in further queries. • Other fields – RL: Record length, RDATA: IP address, referral, … – TTL: Time To Live in a cache 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 67 Outline 1. Overview – Internet Infrastructure elements & Large Scale Failures 2. Dealing with Routing Failures 1. Routing algorithms & their properties 2. Improving BGP Convergence 3. Other Performance Metrics 3. Dealing with Name Resolution Failures 1. Name resolution preliminaries 2. DNS vulnerabilities & Solution 4. Conclusions and Open Issues 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 68 DNS Attacks • Inject incorrect RR into DNS proxy (poisoning) – Compromise DNS proxy (hard) – Intercept query & send fake or addl response • Query interception relatively easy … – UDP based Don’t need any context! – DNS query uses 16-bit “client-id” to connect query w/ response. • Fairly static, can be guessed easily. – Response can include additional RR’s • Intercept updates to authoritative server – Technically not poisoning, but a problem 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 69 Poisoning Consequences root • Can be exploited in many ways: – Disallow name resolution – Direct all traffic small set of servers • DDOS attack! – Direct to a malicious server to collect info or drop malware • Scale of attack simply depends on the level in the hierarchy! – Poison propagates downwards – Set large TTL to avoid expiry – Actual scenario in Mar ’05 (.com entry poisoned) 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial au sg gov edu gb ips nz sa Proxy Cache 70 Making DNS Robust • TSIG (symmetric key crypto) – Intended for secure masterslave proxy comm. – Issues: Not general, Scalability • DNSSEC – Stops cache poisoning, but issues of overhead, infrastructure change, key mgmt, etc. – Based on PKI, a symmetric key version also exists. • Cooperative Lookup – Direct requests to responsive clients (CoDNS) – Distributed hash table (DHT) structure for DNS (CoDoNS) – Cooperative checking between clients (DoX) 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 71 PK-DNSSEC • Auth. chain starts from root – Parent signs child certificates (avoids “lying” about public key) – Encrypted exchange also supplies signed public keys • F: public key, f: private key Fgov(query) DNS proxy fgov(resp, Fgb) Fgb(query) 7/16/2015 Root Cert. Root priv. key au Cert. au priv. key gov Cert. gov priv. key gb Cert. fgb(resp) K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial root au sg gov edu gb ips nz sa 72 CoDoNS • Organize DNS using DHT (distributed hash table). – Enhances availability via distribution and replication • Explicit version control to keep all copies current • Issues – DHT issues (equal capacity nodes) – Explicit version control unscalable. – Not directed towards poisoning control (but DNSSEC can be used) 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 73 Domain Name Cross-referencing (DoX) root root • Client peer groups – Diversity & common interest based – Peers agree to cooperate on verification of popular records. au sg gov edu ips gb 7/16/2015 sg gov edu ips gb sa nz sa Peer1 Verify Peer3 Peer4 root root au sg gov edu gb K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial au Peer2 • Mutual verification – Assumes that authoritative server is not poisoned. nz ips nz sa au sg gov edu gb ips nz sa 74 Choosing Peers • Give & get – Give: A peer must participate in verification even if it is not interested in the address Overhead – Get: Immediate poison detection, high data currency. • Selection of peers – Topic channel w/ subscription by peers • E.g. names under a Google/Yahoo directory – Community channel, e.g., peers within the same org • Minimizing overhead – Verify only popular (perhaps most vulnerable) names only. – May be adequate given the usual Zipf-like popularity dist. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 75 DoX Verification • Verification cache per peer – Avoids unnecessary re-verification • Verification – DNS copy (Rd) = verified copy (Rv) Stop Else send (Ro,Rn) = (Rv,Rd) to all other peers – At least m peers agree Stop, else obtain authoritative copy Ra & if Ra != Rd, poison detected. • Agreement procedure – Involves local copy Rv & remotely received (Ro,Rn) – If Rv=Rn agree, else peer obtains its own authoritative copy Ra – Several cases, e.g., if Rv=Ro, Ra=Rn agree • Verified copy was obsolete, got correct one now Forced removal of obsolete copy 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 76 Handling Multiple IPs per name • DNS directed load distribution – Easily handled with set comparison • Multiple Views – Used to differentiate inside/outside clients – All peers should belong to same view (statically or trial & error). • Content Distribution Networks (CDNs) – Same name translates to different IP addresses in different regions – Need a flowset based IP address comparison 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 77 Results – Normal DNS # In Cache 5000 correct obsolete poison 4000 3000 2000 1000 0 % Replies 100 correct obsolete poison 80 60 40 20 0 0.0 0.5 1.0 Time (minutes) 1.5 2.0 x1e4 • Poison spreads in the cache • More queries are affected 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 78 poisoned # Poisoned 5 4 3 2 1 0 x1e5 5 4 3 2 1 0 100 80 60 40 20 0 0.0 Results – DoX % Replies # In Cache correct obsolete correct obsolete 0.2 0.4 0.6 Time (minutes) 0.8 1.0 x1e4 • Poison removed immediately 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 79 DoX vs. DNSSEC Characteristic DNSSEC CoDoNS DoX Poison detection No poisoning Yes Yes, unless all peers poisoned Poison removal No poisoning No Quick, when possible Effect on obsolete data No effect Explicit update propagation Improves data currency Overhead High CPU overhead, increased msg size Significant replication Latency & overhead of overhead inter-peer comm. Protocol impact Substantial change Easy to interface w/ regular DNS None (can implement on top of base DNS) Vulnerability Cryptographically secure Bit more robust (due to distribution & proactive replication) Coordinated attack can defeat it. Updates must be secured. Deployment Difficult Fairly easy Easy Other Challenges Key distribution Unequal node capacities Good choice of peers 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 80 Conclusions & Open Issues • DNS has numerous vulnerabilities & easy to attack – Several proposed solution, none entirely satisfactory – Large deployed base resists significant overhaul • Future Work – – – – Combine the best of DNSSEC, CoDoNS & DoX. Choice of peers & hardening against malicious peers. Tackling the delegation “mess” Math. analysis w/ delegation, non-poisson traffic, … • How do we make DNS robust against large scale coordinated attacks? 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 81 That’s all folks! Questions? 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 82 BGP References • • • • • • • • • • • • • A.L. Barabasi and R. Albert, “Emergence of Scaling in Random Networks,” Science, pp. 509–512, Oct. 1999. A. Bremler-Barr, Y. Afek, and S. Schwarz, “Improved BGP convergence via ghost flushing,” in Proc. IEEE INFOCOM 2003, vol. 2, San Francisco, CA, Mar 2003, pp. 927–937. S. Deshpande and B. Sikdar,“On the Impact of Route Processing and MRAI Timers on BGP Convergence Times,” in Proc. GLOBECOM 2004, Vol. 2, pp 1147- 1151. T.G. Griffin and B.J. Premore, “An experimental analysis of BGP convergence time,” in Proc. ICNP 2001, Riverside, California, Nov 2001, pp. 53–61. F. Hao, S. Kamat, and P. V. Koppol, "On metrics for evaluating BGP routing convergence," Bell Labora- tories Tech. Rep., 2003. C. Labovitz, G. R. Malan, and F. Jahanian, “Internet Routing Instability,” IEEE/ACM Transactions on Networking, vol. 6, no. 5, pp. 515–528, Oct. 1998. C. Labovitz, Ahuja, et al., “Delayed internet routing convergence,” in Proc. ACM SIGCOMM 2000, Stockholm, Sweden, Aug 2000, pp. 175–187. C. Labovitz, A. Ahuja, et al., “The Impact of Internet Policy and Topology on Delayed Routing Convergence,” in Proc. IEEE INFOCOM 2001, vol. 1, Anchorage, Alaska, Apr 2001, pp. 537–546. A. Lakhina, J.W. Byers, et al., “On the Geographic Location of Internet Resources,” IEEE Journal on Selected Areas in Communications, vol. 21 , no. 6, pp. 934–948, Aug. 2003. A. Medina, A. Lakhina, et al., “Brite: Universal topology generation from a user’s perspective,” in Proc. MASCOTS 2001, Cincinnati, Ohio, Aug 2001, pp. 346-353. D. Obradovic, “Real-time Model and Convergence Time of BGP,” in Proc. IEEE INFOCOM 2002, vol. 2, New York, June 2002, pp. 893–901. D. Pei, et al., "A study of packet delivery perfor- mance during routing convergence," in Proc. DSN 2003, San Francisco, CA, June 2003, pp. 183-192. Dan Pei, B. Zhang, et al., “An analysis of convergence delay in path vector routing protocols,” Computer Networks, vol. 30, no. 3, Feb. 2006, pp. 398–421. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 83 BGP References • • • • • • • • • • • • • • D. Pei, X. Zhao, et al., “Improving BGP convergence through consistency assertions,” in Proc. IEEE INFOCOM 2002, vol. 2, New York, NY, June 23–27, 2002, pp. 902–911. Y. Rekhter, T. Li, and S. Hares, “Border Gateway Protocol 4,” RFC 4271, Jan. 2006. J. Rexford, J. Wang, et al., “BGP routing stability of popular destinations,” in Proc. Internet Measurement Workshop 2002, Marseille, France, Nov. 6–8, 2002, pp. 197–202. A. Sahoo, K. Kant, and P. Mohapatra, “Characterization of BGP recovery under Large-scale Failures,” in Proc. ICC 2006, Istanbul, Turkey, June 11–15, 2006. A. Sahoo, K. Kant, and P. Mohapatra, “Improving BGP Convergence Delay for Large Scale Failures,” in Proc. DSN 2006, June 25-28, 2006, Philadelphia, Pennsylvania, pp. 323-332. A. Sahoo, K. Kant, and P. Mohapatra, "Speculative Route Invalidation to Improve BGP Convergence Delay under Large-Scale Failures," in Proc. ICCCN 2006, Arlington, VA, Oct. 2006. A. Sahoo, K. Kant, and P. Mohapatra, “Improving Packet Delivery Performance of BGP During Large-Scale Failures", submitted to Globecom 2007. “SSFNet: Scalable Simulation Framework”. [Online]. Available: http://www.ssfnet.org/ W. Sun, Z. M. Mao, K. G. Shin, “Differentiated BGP Update Processing for Improved Routing Convergence,” in Proc. ICNP 2006, Santa Barbara, CA, Nov. 12–15, 2006 , pp. 280–289. H. Tangmunarunkit, J. Doyle, et al, “Does Size Determine Degree in AS Topology?,” ACM SIGCOMM, vol. 31, issue 5, pp. 7–10, Oct. 2001. R. Teixeira, S. Agarwal, and J. Rexford, “BGP routing changes: Merging views from two ISPs,” ACM SIGCOMM, vol. 35, issue 5, pp. 79–82, Oct. 2005. B. Waxman, “Routing of Multipoint Connections,” IEEE Journal on Selected Areas in Communications, vol. 6, no. 9, pp. 1617–1622, Dec. 1988. B. Zhang, R. Liu, et al., “Measuring the internet’s vital statistics: Collecting the internet AS-level topology ,” ACM SIGCOMM, vol. 35, issue 1, pp. 53–61, Jan. 2005. B. Zhang, D. Massey, and L. Zhang, "Destination Reachability and BGP Convergence Time," in Proc. GLOBECOM 2004, vol. 3, Dallas, TX, Nov 3, 2004, 1383-1389. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 84 DNS References • • • • • • • • • R. Arends, R. Austein, et.al, ``DNS Security Introduction & Requirements,'' RFC 4033, 2005. G. Ateniese & S. Mangard, ``A new approach to DNS security (DNSSEC),'' in Proc. 8th ACM conf on comp & comm system security, 2001. D. Atkins & R. Austein, ``Threat analysis of the domain name system,'‘ \urlhttp://www.rfc-archive.org/getrfc.php?rfc=3833, August 2004. R. Curtmola, A. D. Sorbo, & G. Ateniese, ``On the performance and analysis of dns security extensions,'' in Proceedings of CANS, 2005. M. Theimer & M. B. Jones, ``Overlook: Scalable name service on an overlay network,'' in Proc. 22nd ICDCS, 2002. K. Park, V. Pai, et.al, ``CoDNS: Improving DNS performance and reliability via cooperative lookups,'' in Proc. 6th Symp on OS design & impl., 2004. L. Yuan, K. Kant, et. al, ``DoX: A peer-to-peer antidote for DNS cache poisoning attacks,'' in Proc. IEEE ICC, 2006. L. Yuan, K. Kant & P. Mohapatra, ``A proxy view of quality of domain name service,'' in IEEE Infocom 2007. V. Ramasubramanium & E.G. Sirer, “The design & implementation of next generation name service for internet”, Sigcom 2004. 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 85 Backup 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 86 Quality of DNS Service (QoDNS) Availability – – Measures if DNS can answer the query. Prob of correct referral when record is not cached. Accuracy – Prob of hitting a stale record in proxy cache Poison Propagation – Prob(poison at leaf level at t=t | level k poisoned at t=0) Latency – Additional time per query Overhead – Additional msgs/BW per query 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 87 Computation of Metrics Y TTL XR modification MR C O TTL expiration U X miss hit hit hit hit miss • Modification at authoritative server – Copy obsolete but proxy not aware until TTL expires & a new query forces a reload • XR: Residual time of query arrival process • MR: Residual time of modification process • Y: Inter-miss period = TTL + XR 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 88 Dealing with Hierarchy .gov Level h-1 Level h .ips .sa .gb • A miss at a node a query at its parent • Superposition of miss processes of children query arrival process of parent • Recursively derive the arrival process bottom-up 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 89 Deriving QoDNS Metrics • • Accuracy: – Prob. leaf record is “current” • (Un)Availability – Prob. BMR is “Obsolete” referral .sg .nz BMR is Obsolete Referral • Latency – RTT x # referrals • Overhead X Failures, DSN 2007 Tutorial .au .edu .ips .sa .gb .xys .qwe .abc – Related to #referrals for current RRs & #tries for obsolete RRs K. Kant, Surviving Large Scale Internet 7/16/2015 .gov BMR is Current BMR is Obsolete Record 90 Model Validation • Poisson and Gamma arrival model • Uniform/Zipf popularity Accuracy Overhead Higher rate More caching Latency Unavailability Query Arrival Rate 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 91 Survey of TTL Values • 2.7 Million names on dmoz.org • 1 hr, 1 day, 2 days dominates • Some extremely small values • How to pick the TTL for a domain? 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial CDF of TTLs 92 Impact of TTL overhead No modification Moderate modification Frequent modification % failures TTL 7/16/2015 K. Kant, Surviving Large Scale Internet Failures, DSN 2007 Tutorial 93