Evaluating NGI performance Matt Mathis [email protected] Evaluating NGI Performance • How well is the NGI being used? • Where can we do better?
Download ReportTranscript Evaluating NGI performance Matt Mathis [email protected] Evaluating NGI Performance • How well is the NGI being used? • Where can we do better?
Evaluating NGI performance Matt Mathis [email protected] 1 Evaluating NGI Performance • How well is the NGI being used? • Where can we do better? 2 Outline • Why is this such a hard problem? – Architectural reasons – Scale • A systematic approach 3 TCP/IP Layering • The good news: – TCP/IP hides the details of the network from users and applications – This is largely responsible for the explosive growth of the Internet 4 TCP/IP Layering • The bad news: – All bugs and inefficiencies are hidden from users, applications and network administrators – The only legal symptoms for any problem anywhere are connection failures or less than expected performance 5 Six performance problems • IP Path – Packet routing, round trip time – Packet reordering – Packet losses, Congestion, Lame HW • Host or end-system – MSS negotiation, MTU discovery – TCP sender or receiver buffer space – Inefficient applications 6 Layering obscures problems • Consider: trying to fix the weakest link of an invisible chain • Typical users, system and network administrators routinely fail to “tune” their own systems • In the future, WEB100 will help… 7 NGI Measurement Challenges • The NGI is so large and complex that you can not observe all of it directly. • We want to assess both network and endsystem problems – The problems mask each other – The users & admins can’t even diagnose their own problems 8 The Strategy • Decouple paths from end-systems – Test some paths using well understood endsystems – Collect packet traces and algorithmically characterize performance problems 9 Performance is minimum of: • TCP bulk transport (path limitation): MSS C Rate RTT p C0.7 • Sender or receiver TCP buffer space: MSS Rate Size RTT • Application, CPU or other I/O limit 10 Packet trace instrumentation • Independent measures of model: – Data rate, MSS, RTT and p – Measure independent distributions for each • Detect end system limitations – Whenever the model does not fit MSS C Rate RTT p 11 The Experiments • Actively test a (small) collection of paths with carefully tuned systems • Passively trace and diagnose all traffic at a small number of points to observe large collections of paths and end systems. • [Wanted] Passively observe flow statistics for many NGI paths to take a complete census of all end systems capable of high data rates. 12 Active Path Testing • Use uniform test systems – Mostly Hans Werner Braun’s AMP systems – Well tuned systems and application – Known TCP properties • Star topology from PSC for initial tests – Evolve to multi star and sparse mesh • Use passive instrumentation 13 Typical (Active) Data • 83 paths measured • For the moment assume: – All host problems have been eliminated – All bottlenecks are due to the path • Use traces to measure path properties – Rate, MSS, and RTT – Estimate window sizes and loss interval • Sample has target selection bias 14 Data Rate 1 0.9 0.8 CDF 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 M bits/s 15 Data Rate Observations • Only one path performed well – (74 Mbit/s) • About 15% of the paths beat 100MB/30s – (27 Mbit/s) • About half of the paths were below old Ethernet rates – (10 Mbit/s) 16 Round Trip Times 1 0.9 0.8 CDF 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 RTT (ms) 17 RTT Observations • About 25% of the RTTs are too high (PSC to San Diego is ~70 ms) – Many reflect routing problems – At least a few are queuing (traffic) related 18 Loss Interval (1/p) 1 0.9 0.8 0.7 CDF 0.6 0.5 0.4 0.3 0.2 0.1 0 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 Packets between losses 19 Loss Interval Observations • Only a few paths do very well – Some low-loss paths have high delay • Only paths with fewer losses than 10 per million are ok • Finding packet losses at this level can be difficult 20 Passive trace diagnose • Trace Analysis and Automatic Diagnosis (TAAD) • Passively observe user traffic to measure the network • These are very early results 21 Example Passive Data • • • • Traffic is through the Pittsburgh GigaPoP Collected with MCI/NLANR/CAIDA OC3-mon and coralreef software This data set is mostly commodity traffic Future data sets will be self weighted NGI samples 22 Observed and Predicted Window • Window can be observed by looking at TCP retransmissions • Window can be predicted from the observed interval between losses • If they agree the flow is path limited – The bulk performance model fits the data • If they don’t, the flow is end system limited – Observed window is probably due to buffer limits but may be due to other bottlenecks 23 24 Window Sizes 80000 70000 Population 60000 50000 Observed 40000 Predicted 30000 20000 10000 0 0 10 20 30 40 50 Window (k Bytes) 25 Window Sizes 1 0.9 0.8 CDF 0.7 0.6 Observed 0.5 Predicted 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Window (kBytes) 26 Observations • 60% of the commodity flows are path limited with window sizes smaller than 5kBytes • Huge discontinuity at 8kBytes reflects common default buffer limits • About 15% of the flows are affected by this limit 27 Need NGI host census • Populations of end systems which have reached significant performance plateaus • Have solved “all” performance problems • Confirm other distributions • Best collected within the network itself 28 Conclusion • TCP/IP layering confounds diagnosis – Especially with multiple problems • Many pervasive network and host problems – Multiple problems seem to be the norm • Better diagnosis requires better visibility – Ergo WEB100 29