IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep 1517 ‘04 www.slac.stanford.edu/grp/scs/net/talk03/scidac-pinger-sep04.ppt.
Download ReportTranscript IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep 1517 ‘04 www.slac.stanford.edu/grp/scs/net/talk03/scidac-pinger-sep04.ppt.
IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep 1517 ‘04 www.slac.stanford.edu/grp/scs/net/talk03/scidac-pinger-sep04.ppt 1 Outline • PingER – Purpose etc. – Methodology – Results • PingER-NG ≡ IEPM-BW – Low network impact bandwidth tool (INCITE) – Traceroute viz – Topology (INCITE) 2 PingER • Uses ping to provides lightweight performance monitoring: – < 100bits/s per pair measured – No software to install at remote sites – Measures loss, RTT, reachability, jitter • For planning, trouble shooting • Originally (1990s) for HENP sites • More recently also to characterize the Digital Divide – ICFA/SCIC, Internet2 Hard to Reach Places, WSIS, 3 ICTP/eJDS Methodology • Use ubiquitous ping • Each 30 minutes from monitoring site to target : – 1 ping to prime caches – by default send11x100Byte pkts followed by 10x1000Byte pkts • Low network impact + no software to install / configure / maintain at remote sites + no passwords / accounts needed = good for developing sites / regions • Record loss & RTT, (+ reorders, duplicates) • Derive throughput, jitter, unreachability … 4 Architecture SLAC Archive WWW HTTP Reports & Data Ping Archive Monitoring Cache Monitoring Remote Remote FNAL ~35 Monitoring Monitoring 1 monitor host remote host pair Remote ~550 Remote • Hierarchical vs. full mesh 5 Coverage • In last 9 months added: – Several sites in Russia (thanks GLORIAD) – Many hosts in Africa (5=>36 now in 27 out of 54 countries) – Monitoring sites in Pakistan and Brazil (Sao Paolo and Rio) • Now monitoring 650 sites in 115 countries • Working to install monitoring host in Bangalore, India Monitoring site Remote site 6 C. Asia, Russia, S.E. Europe, L. America, M. East, China: 45 yrs behind World View S.E. Europe, Russia: catching up Latin Am., Mid East, China: keeping up India, Africa: 7 yrs behind India, Africa: falling behind TCP throughput measured from N. America to World Regions C. Asia (8) Latin America (37) 50% Improvement/year ~ factor of 10 in < 6 years 10000 Edu (141) 1000 1000 Europe(150) Canada (27) 100 100 Mid East (16) S.E. Europe (21) 10 10 Caucasus (8) Dec-04 Dec-03 1 Dec-02 Dec-01 Africa (30) Dec-00 India(7) Dec-99 Dec-97 Dec-96 1 Jan-96 Russia(17) Dec-98 China (13) Jan-95 Derived TCP throughput in KBytes/sec Important for policy makers 10000 From the PingER project, Aug 2004 7 View from CERN • Confirms view from N. America Derived TCP throughput Kbits/s 100000 TCP throughput from CERN to World Regions 10000 Europe 1000 SE Europe N America M East India 100 L America RussiaChina 10 Africa From the PingER project August 2004. 1 Feb-98 Jun-99 Oct-00 Mar-02 Jul-03 8 Dec-04 From Developing Regions Africa Balkans Europe N. America S. America TCP throughput from Novosibirsk to world regions Novosibirsk Novosibirsk Derived throughput in Knits/s 10000 N. America Australasia E. Asia M. East Russia S. Asia big loss increase to Moscow (from < 1% to 2-3%) Moscow Japan/China 1000 NSK to Moscow used to be OK but loss went up in Sep. 2003 GLORIAD may help 100 10 1 Sep-02 Dec-02 Mar-03 Jun-03 Oct-03 Jan-04 Apr-04 Aug-04 Derived TCP throughput KBytes/s TCP throughput measured from Brazil to World Regions 10000 Brazil (Sao Paolo) Latin America Europe 1000 N. America 100 10 Jan-04 Africa Russia Feb-04 E. Asia S. America Mar-04 Apr-04 Europe S. Asia May-04 Jun-04 N. America Jul-04 Aug-04 As expected Brazil to L. America is good Actually dominated by Brazil to Brazil To Chile & Uruguay poor since 9 goes via US Technology Achievement Index (TAI) • TAI captures how well a country is creating and diffusing technology and building a human skills base. • TAI from UNDP hdr.undp.org/reports/global/2001/en/pdf/techindex.pdf TAI top 12 Finland US Sweden Japan Korea Rep. of Netherlands UK Canada Australia Singapore Germany Norway 0.744 0.733 0.703 0.698 0.666 0.630 0.606 0.589 0.587 0.585 0.583 0.579 US & Canada off-scale 10 PingER-NG = IEPM-BW • Need measurement tools for high-performance paths/applications – BER 10-8 takes > day to see 1 loss – Ping losses ≠ TCP losses • Build infrastructure to – Measure with: • Iperf (TCP mem-to-mem), GridFTP, bbftp • Lightweight packet pair dispersion – Evaluate measurement tools 11 Low impact bandwidth measurement • Goals: – Make a measurement in < second rather than tens of seconds – Injects little network traffic – Provide reasonable agreement with more intense methods (e.g. iperf) • Enables: – Measurements of low performance links (e.g. to developing countries) – Helps avoid need for scheduling – More frequent measurements (minutes vs. hours) – Lower impact more friendly 12 Low impact Bandwidth • Use 20 packet pairs to roughly estimate dynamic bw Capacity & Xtraffic, then Available = Capacity – Xtraffic – Capacity min pair separation; Xtraffic packet pair dispersion Dynamic bandwidth capacity (DBC) Iperf Available bandwidth = DBC – X-traffic Cross-traffic ABwE SLAC to Caltech Mar 19, 2004 13 Achievable throughput & file transfer • IEPM-BW – High impact (iperf, bbftp, GridFTP …) measurements 90+-15 min intervals Fwd route change Iperf abing bbftp iperf1 Min RTT Rev route change Avg RTT Select focal area 14 Anomalous Event Detection • Too many graphs to scan by hand, need to automate – SLAC Caltech link performance dropped by factor 5 for ~ month before noticed, fixed within 4 hours of reporting • Looking for long-term step down changes in bandwidth • Use modified “plateau” algorithm from NLANR – Divide data into history & trigger buffer – If y < mh – b * sh then trigger, else history (b=2) • When trigger buffer fills: if mt < d * mh, then have an event 15 Route table Example • Compact so can see many routes at once History navigation Route # at start of day, gives idea of root stability Multiple route changes (due to GEANT), later restored to original route Mouseover for hops & RTT Available bandwidth Raw traceroute logs for debugging Textual summary of traceroutes for email to ISP Description of route numbers with date last seen User readable (web table) routes for this host for this day 16 Another example Get AS information for routes Level change Host not pingable TCP probe type ICMP checksum error Intermediate router does not respond 17 Topology • Choose times and hosts and submit request Hour of day Alternate route SLAC ESnet Nodes colored by ISP Mouseover shows node names Click on node to see subroutes Click on end node to see its path back Also can get raw traceroutes with AS’ GEANT Alternate rt JAnet IN2P3 CESnet CCLRCDL L R 18 SLAC P P ESnet CENIC P Abilene P P Supernet Bandwidth from SLAC to Supernet.org June 2, 2004 Cap 1000 mh 800 mh=954Mbits/s, mt=753Mbits/s (mh-mt)/(sqrt((oh**2+ot**2)/2))=2.4 mh - 2 oh 600 Abw 200 Xtr 0 b sensitivity = 2; d l history buffer length = 600 ttrigger buffer length = 60 Route changes 19 6/3/04 0:00 400 6/2/04 0:00 Bandwidth in Mbits/s SOX Putting it together New features in works (with NIIT) • Improve new site set-up tools • Improve management – Discover non working links faster • Improve access to data and meta data – Provide data base with lat/long, country etc. – Add web services access • Improve visualization: – Provide map with drill down to node information – Automate production of long term trend plots for regions – More node selection capabilities • Traceroute measurement and analysis 20 More • PingER Project – http://www-iepm.slac.stanford.edu/pinger/ – IEEE Communications Magazine on Network Traffic Measurements and Experiments. • ICFA/SCIC Network Monitoring report, Jan ‘04 – http://www.slac.stanford.edu/xorg/icfa/icfa-netpaper-jan04/ • IEPM-BW – http://www-iepm.slac.stanford.edu/ 21