Network diagnostics made easy

Download Report

Transcript Network diagnostics made easy

1 / 8

Pathdiag: Automatic TCP Diagnosis

Matt Mathis (PSC) John Heffner (PSC/Rinera) Peter O'Neil (NCAR/Mid-Atlantic Crossroads) Pete Siempsen (NCAR) 30 April 2008 http://staff.psc.edu/mathis/papers/ PAM20080430.ppt

Outline

• What is the problem?

• The pathdiag solution • Details • The bigger problem

What is the problem?

Internet 2 weekly traffic statistics – About 3 Mb/s!

Why is end-to-end performance difficult?

• By design TCP/IP hides the ‘net from upper layers – TCP/IP provides basic reliable data delivery – The “hour glass” between applications and networks • This is a good thing, because it allows: – Invisible recovery from data loss, etc – Old applications to use new networks – New application to use old networks • But then (nearly) all problems have the same symptom – Less than expected performance – The details are hidden from nearly everyone

TCP tuning is painful debugging

• All problems reduce performance – But the specific symptoms are hidden • Any one problem can prevent good performance – Completely masking all other problems • Trying to fix the weakest link of an invisible chain – General tendency is to guess and “fix” random parts – Repairs are sometimes “random walks” – Repair one problem at time at best • The solution is to instrument TCP

The Web100 project

• Use TCP's ideal diagnostic vantage point – Instrument TCP: What is limiting the data rate?

– RFC 4898 TCP-ESTATS-MIB • Standards track • Prototypes for Linux (www.Web100.org) and Windows Vista – Fix TCP's part of the problem: Autotuning • Automatically adjusts TCP socket buffers • Linux 2.6.17 default maximum window size is 4 M Bytes • Microsoft Vista default maximum window size is 8 M bytes – (Except IE) – Web100 is done • But still under limited support

New insight: symptoms scale with RTT

• Example flaws: – TCP Buffer Space:

Rate=Buffer

/

RTT

– Packet loss:

Rate= MSS

/

RTT

∗ 1 /

Loss

• Think: RTT in the denominator converts “rounds” to elapsed time.

Symptom scaling breaks diagnostics

• Local Client to Server – Flaw has insignificant symptoms – All applications work, including all standard diagnostics – False pass for all diagnostic tests • Remote Client to Server: all applications

fail

– Leading to faulty implication of other components • It seems that the flaws are in the wide are network

The confounded problems

• • For nearly all network flaws – The only symptom is reduced performance – But the reduction is scaled by RTT • Therefore, flaws are undetectable on short paths – False pass for even the best conventional diagnostics – Leads to faulty inductive reasoning about flaw locations – Diagnosis often relies on tomography and complicated inference techniques

This is the real end-to-end performance problem

Goals

• We want to automate debugging for “the masses” – But start with low hanging fruit • Who are the users? Assume: – Analytic (e.g. Non-network scientists) • Not afraid of math or measurements – Known data sources • Primary data direction is towards the users – That they have systems and network support • Only need to do first level diagnosis

More Goals

• Automatic – “one click” in a web browser • Diagnose first level problems – Easily expose all path bottlenecks that limit performance to less than 10 MByte/s – Easily expose all end-system/OS problems that limit performance to less than 10 MByte/s • Will become moot as autotuning is deployed • Empower the users to apply the proper motivation – Results need to be accurate, well explained and common to both users and sys/net admins

The pathdiag solution

• Test a short section of the path – Most often first or last mile • Use Web100 to collect detailed TCP statistics – Loss, delay, queuing properties, etc • Use models to extrapolate results to the full path – Assume that the rest of the path is ideal – The user has to specify the end-to-end goal • Data rate and RTT • Pass/Fail on the basis of the extrapolated performance

Deploy as a Diagnostic Server

• Use pathdiag in a Diagnostic Server (DS) • Specify End to End target performance – From server (S) to client (C) (RTT and data rate) • Measure the performance from DS to C – Use Web100 in the DS to collect detailed statistics • On both the path and client – Extrapolate performance assuming ideal backbone • Pass/Fail on the basis of extrapolated performance

Demo

• Click here for a live server

Pathdiag output

Pathdiag output

Key NPAD/pathdiag features

• Results are intended to be self explanitory – Provides a list of specific items to be corrected •

Failed tests are show stoppers for fast applications

– Includes explanations and tutorial information – Clear differentiation between client and path problems – Accurate escalation to network or system admins – The reports are public and can be viewed by either • Coverage for a majority of OS and last-mile network flaws – Coverage is one way – need to reverse client and server – Does not test the application – need application tools – Does not check routing – need traceroute – Does not check for middleboxes (NATs etc).

Eliminates nearly all(?) false pass results

More features

• Tests becomes more sensitive as the path gets shorter – Conventional diagnostics become less sensitive – Depending on models, perhaps too sensitive • New problem is false fail (e.g. queue space tests) • Flaws no longer completely mask other flaws – A single test often detects several flaws • E.g. Can find both OS and network flaws in the same run – They can be repaired concurrently • Archived DS results include raw web100 data [Sample] – Can reprocess with updated reporting SW • New reports from old data – Critical feedback for the NPAD project • We really want to collect “interesting” failures

Under the covers

• Same base algorithm as “Windowed Ping” [Mathis, INET’94] – Aka “mping” – See http://www.psc.edu/~mathis/wping/ – Killer diagnostic in use at PSC in the early 90s – Stopped being useful with the advent of “fast path” routers • Use a simple fixed window protocol – Scan window size in 1 second steps • Pathdiag clamps cwnd to control the TCP window • Varies step size – fine steps near interesting features – Measure data rate, loss rate, RTT, etc as window changes – Reports reflect key features of the measured data

Window Size vs Data Rate

Window Size vs Loss Rate

Window Size vs RTT

Window Size vs Power

Power=Rate

/

RTT

The Bigger Picture

• Download and Install – http://www.psc.edu/networking/projects/pathdiag/ – The hardest part is building a Linux kernel – Beyond end-of-funding, still under limited support • Barriers to adoption – User expectations – Our language – Network administrators

Need to recalibrate user expectations

• Long history of very poor network performance – Users do not know what to expect – Users have become completely numb – Users have no clue about how poorly they are doing – Because TCP/IP hides the network all too well • We need to re-educate R&E users: – Less than 1/2 gigabyte per minute is

not

highspeed – Everyone should be able to reach this rate – People who can’t should know why or be angry

Language problems

• Nobody except network geeks use bits/second • BTW on the last slide: – 1/2 gigabyte/minute is about • 10 M Byte/s or • 80 Mb/s • 17 year old LAN technology (FDDI) – Nothing slower should be considered “High Speed”

Campus network administrators

• Generally very underfunded, and know it • Can't support all users equally • Don't want users to compare results • Don't want to enable accurate user complaints • Don't want pathdiag • Workaround: deploy “upstream”

Closing

• Satisfied our immediate technical goals • The bigger problem still requires a lot more work

Backup slides

What about impact of the test traffic?

• Pathdiag server is single threaded – Only one test at a time • Same load as any well tuned TCP application – Protected by TCP “fairness” • Large flows are generally “softer” than small flows • Large flows are easily disturbed by small flows • Note that any short RTT flow is stiffer than a long RTT flow

NPAD/pathdiag deployment

• Why should a campus networking organization care?

– “Zero effort” solution to miss-tuned end-systems – Accurate reports of real problems • You have the same view as the user • Saves time when there really is a problem • You can document reality for management • Suggestion: – Require pathdiag reports for all performance problems

Download and install

• User documentation: http://www.psc.edu/networking/projects/pathdiag/ • Follow the link to “Installing a Server” – Easily customized with a site specific skin – Designed to be easily upgraded with new releases • Roughly every 2 months • Improving reports through ongoing field experience – Drops into existing NDT servers • Plans for future integration • Enjoy!

The Wizard Gap

The Wizard Gap Updated

• Experts have topped out end systems & links – 10 Gb/s NIC bottleneck – 40 Gb/s “link” bandwidth (striped) • Median I2 bulk rate is 3 Mbit/s – See http://netflow.internet2.edu/weekly/ • Current Gap is about 3000:1 • Closing the first factor of 30 should now be “easy”

Pathdiag

• Initial version aimed at “NSF domain scientists” – People with non-networking analytical background • Report designed to – accurately identify subsystem – provide tutorial – provide good escalation to network or host admin – support the user as the ultimate judge of success • Future plan to split reports – Even easier for non-experts – Better information for experts

Pathdiag

• One click automatic performance diagnosis – Designed for (non-expert) end users • Future version will better support both expert and non-expert – Accurate end-systems and last mile diagnosis • Eliminate most false pass results • Accurate distinction between host and path flaws • Accurate and specific identification of most flaws – Basic networking tutorial info • Help the end user understand the problem • Help train 1st tier support (sysadmin or netadmin) • Backup documentation for support escalation • Empower the user to get it fixed – The same reports for users and admins