ICFA/SCIC Network Monitoring Prepared by Les Cottrell, SLAC, for the Digital Divide and HEPGrid Workshop UERJ – Rio de Janeiro, Brazil Feb 16 –

Download Report

Transcript ICFA/SCIC Network Monitoring Prepared by Les Cottrell, SLAC, for the Digital Divide and HEPGrid Workshop UERJ – Rio de Janeiro, Brazil Feb 16 –

ICFA/SCIC Network Monitoring
Prepared by Les Cottrell, SLAC, for the
Digital Divide and HEPGrid Workshop
UERJ – Rio de Janeiro, Brazil
Feb 16 – 20, 2004
www.slac.stanford.edu/grp/scs/net/talk03/icfa-feb04.ppt
Partially funded by DOE/MICS Field Work Proposal on
Internet End-to-end Performance Monitoring (IEPM), also
supported by IUPAP
1
Outline
•
•
•
•
What is PingER
World Internet performance trends
Regions and Digital Divide
Examples:
– Africa, Latin America, Brazil, Pakistan
•
•
•
•
•
Compare with Development Index
Challenges
Collaborations & Funding
Futures for PingER
Summary of state of world Internet performance
2
Methodology
• Use ubiquitous ping
• Each 30 minutes from monitoring site to target :
– 1 ping to prime caches
– by default send11x100Byte pkts followed by
10x1000Byte pkts
• Low network impact good for developing world
• Record loss & RTT, (+ reorders, duplicates)
• Derive throughput, jitter, unreachability …
3
Architecture
SLAC
Archive
WWW
HTTP
Reports & Data
Ping
Archive
Monitoring
Cache
Monitoring
Remote
Remote
FNAL
Monitoring Monitoring
1 monitor host
remote host pair
Remote
Remote
• Hierarchical vs. full mesh
4
PingER Benefits
• Aimed at: end-user (net-admin & sophisticated user), planners
• Measures analyzes & reports round-trip times, losses, availability,
throughput ...
– Uses ubiquitous ping, no special host, or software to install/configure at remote
sites, no passwords or credentials needed
– Low impact on network << 100bits/s, important for many DD sites
– Covers 100+ countries (> 90% of Internet connected population)
• Provides quantitative historical (> 8yrs)
and near real-time information
– Aggregate by regions, affiliations etc.
– How bad is performance to various
regions, rank countries?
– Trends: who is catching up, falling behind,
is progress being made?
– Compare vs. economic, development
indicators etc.
• Use for trouble shooting setting
expectations, identify needed upgrades,
choosing a provider, presenting to policy
makers, funding bodies
Monitoring site vs. Remote5sites
screen shot
Countries Monitored
Monitoring hosts Remote hosts
33 hosts
>100 countries
13 Countries
560 sites
Need > 1 host to 880 hosts
reduce anomalies 3700 pairs
Used to monitor
Only 1 host
6
World Trends
• Increase in sites with Good (<1%) loss
• 25% increase in sites monitored
– Big focus on Africa 4=>19 countries
– Silk Road
Loss quality ratings seen from SLAC
300
WSIS
Ping blocking
250
200
60%
150
100 50%
7
Oct-03
Jul-03
Apr-03
Jan-03
Oct-02
Jul-02
Apr-02
Jan-02
Jul-01
Apr-01
Jan-01
Oct-00
Jul-00
0
Oct-01
Dreadful >12%
V. poor >=5% & <12%
Poor >=2.5% & < 5%
Acceptable >=1% & < 2.5%
Good <1%
50
Apr-00
Number of sites
ICTP
Loss to world from US
Loss Rate
< 0.1 to 1 %
1 to 2.5 %
2.5 to 5 %
5 to 12 %
> 12 %
2001
Dec-2003
 In 2001 <20% of
 BUT by December 2003
the world’s
It had improved to 77%
population had
Good or
Acceptable Loss
performance
8
Trends
S.E. Europe, Russia: catching up
Latin Am., Mid East, China: keeping up
India, Africa: falling behind
Derived throughput~MSS/(RTT*sqrt(loss))
Africa shown for only
Uganda seen from SLAC,
since adding new countries
with very different
throughputs distorts result
9
Regions Monitored
•
•
•
•
Recent added NIIT PK as monitoring site
White = no host monitored in country
Colors indicate regions
Also have affinity groups (VOs), e.g. AMPATH, Silk
Road, CMS, XIWT and can select multiple groups 10
Digital Divide Regions
• Design regions
– to match well known world regions and
– to have similar connectivity within region
• Developed:
– U.S.+Canada, Japan+Taiwan+Singapore+Korea,
Australia+NZ, Europe (excl. SE Europe, Russia)
• Developing (Digital Divide):
– Africa, S. America, C. America, C. Asia, China, S. Asia,
Caucasus, M. East, SE Europe, Russia
Israel has much better connectivity than
neighbors in Mid East so distorts Mid East
results, move to Europe?!
Greece is part of Europe, should it be part of
S. E. Europe, choice varies with time…
11
Current State – Aug ‘03
thruput ~ MSS / (RTT * sqrt(loss))
• Within region performance better
– E.g. Ca|EDU|GOV-NA, Hu-SE Eu, Eu-Eu, Jp-E Asia, Au-Au, RuRu|Baltics
• Africa, Caucasus, Central & S. Asia all bad
Bad < 200kbits/s < DSL
Poor > 200, < 500kbits/s
Acceptable > 500kbits/s, < 1000kbits/s
12
Good > 1000kbits/s
Region
Countries
#
Africa
Ghana, Namibia, Nigeria, Namibia, S. Africa,
Uganda
6
C Asia
Kazakhstan, Kyrghzstan, Mongolia, Tajikistan,
Turkmenistan, Uzbekistan
9
S Asia
Bangladesh, India, Indonesia, Malaysia, Pakistan,
Thailand, (Vietnam)
16
M East
Egypt, Iran, Israel, Jordan, Saudi Arabia, Turkey
10
Caucasus
Armenia, Azerbaijan, Georgia
S America
Argentina, Brazil, Columbia, Chile, Peru, Uruguay,
Venezuela
China
China including Hong Kong
5
Russia
Russia
5
C America
Costa Rica, Cuba, Guatemala, Mexico
4
SE Europe
(Albania), Bulgaria, Croatia, Macedonia, Moldova,
Romania, Serbia/Montenegro, Slovenia
Variability
5
13
13
13
Examples of Use
•
•
•
•
Africa
L. America
Brazil
Pakistan:
– NIIT & NCP
14
Loss to Africa
15
Loss to
Latin
America
from US
16
Loss to Brazil from US
Packet loss from SLAC to Brazil
UNESP.BR
CBPF.BR
UNICAMP.BR
Average
16
14
Ping packet loss
12
Share 1st 14 hops to Brazil
via CENIC-Abilene-AMPATH
Share 1st 14 hops via
CENIC-Abilene-AMPATH CBPF
Esnet-Level(3)Embratel (Miami-Rio)
10
8
INPE.BR
UERJ.BR
UFRJ.BR
Expon. (Average)
UERJ
6
UFRJ
INPE
UNICAMP
4
2
0
Feb-02
May-02
Aug-02
Nov-02
Mar-03
Jun-03
Sep-03
Jan-04
Apr-04
17
To Pakistan performance
Routes: ESnet (hops 3-8) - DC
ATT (9-21) - Karachi
Karachi
Nb big differences between sites
NIIT/Rawalpindi
Islamabad
Lahore
Loss %
RTT ms
Routes: ESnet (hops 3-6) - SNV
SINGTEL (7-12) - Karachi
Pakistan Telecom
Karachi
Rawalpindi
Routes: ESnet (hops 3-6) - SNV
SINGTEL (7-12) - Karachi
Pakistan Telecom
Karachi
18
Lahore
From Pakistan Performance
NIIT to SLAC
Route: Pakistan Telecom (2-5)
Rawalpindi, Karachi
SingTel (6-10)
ESnet (11-14) - PAIX
NIIT to CERN
Route: Pakistan Telecom (2-5)
Rawalpindi, Karachi
Concert (6-9) London
DataTAG (11-12) .de
Note similarities, probably due to common bottleneck, probably in Pakistan
19
NIIT performance from
U.S. (SLAC)
Preliminary results, started measurements end Dec 2003.
Ping RTT & Loss
Nb. Heavy
losses during
congested
day-times
Avg daily:
loss~1-2%,
RTT~320ms
Bandwidth measurements using packet pair dispersion & TCP
abing (pkt-pair dispersion):Average To NIIT: ~350Kbits/s From NIIT: ~365 Kbits/s
Iperf/TCP (with SLAC): Average: To NIIT: ~320Kbits/s From NIIT: ~330Kbits/s
Iperf/TCP (with CERN): Average: To NIIT: ~270Kbits/s From NIIT: ~300Kbits/s
Can also derive throughput (assuming standard TCP) from RTT & loss (monthly) using:
BW~1.2*S(1460B)/(RTT*sqrt(loss)  ~ 260Kbits/s (SLAC to NIIT)
~ 630Kbits/s (NIIT to SLAC | CERN)
20
Nominal path bottleneck capacity 1Mbits/s
Available Bandwidth
• Use available bandwidth estimator (abing)
–
–
–
–
Uses packet pair dispersion
Low impact, 40*1450Byte packets
Repeat once/minute
Client at SLAC, mirror/server at NIIT
• red line is the cross-traffic
– deduced by looking at
the variation over and
above the minimum
packet pair separation,
the
• green line is the
bandwidth capacity of
current bottleneck
– deduced from the
minmum packet
separation
• blue line is available
bandwidth = capacitycross-traffic.
21
To NCP Pakistan
• Cannot use PingER to measure to ncp.edu.pk
– Pings blocked at FLAG router (62.216.145.154, AS15412)
on way to Comsats (Pakistani ISP)
• Working with NCP to try and resolve
– Trying to contact FLAG
• Using abing instead
~ 2MBits/s, but link is 384Kbps
Looking for discrepancy
22
Within Pakistan
• SLAC – Karachi U:
– ESnet (hops 3-8) – DC ATT (9-21) – Karachi
• SLAC – NIIT RawalpindiI:
– ESnet (hops 3-6) – SNV, SINGTEL (7-12) – Karachi, Pakistan Telecom
Karachi-Rawalpindi
• SLAC - U Lahore, similar to NIIT
• SLAC – NSC:
– ESnet (hops 1-6), C&W (7-11) Santa Clara – NY, FLAG (12-16) NY –
London – Karlsruhe, Comsats
• NIIT – NSC (Rawalpindi – Islamabad) few miles apart,
–No peering in Pakistan, can this be changed?
–Route goes via England:
•PIE (hops 1-5), Concert (6-9)- London, FLAG (10-14) London –
Karachi, Comsats (15)
–Takes longer than to SLAC
23
Pakistan Conclusions
• Big performance differences to sites, depend on ISP
(at least 3 ISPs seen for Pakistan A&R sites)
• To NIIT:
– get about 300Kbps, possibly 380Kbps at best
– Suspicious that the bottleneck is not 1Mbps but rather closer
to 400Kbps.
– The bottleneck appears to be in Pakistan
– There is often congestion (packet loss & extended RTTs)
during busy periods each weekday
– Video will probably be sensitive to packet loss, so it may
depend on the time of day
– H.323 (typically needs 384Kbps + 64Kbps), would appear to
be marginal at best at any time.
• No peering Pakistan between NIIT and NSC
24
Technology Achievement
Index (TAI)
• TAI captures how well a country is creating and diffusing
technology and building a human skills base.
• TAI from UNDP hdr.undp.org/reports/global/2001/en/pdf/techindex.pdf
TAI top 12
Finland
US
Sweden
Japan
Korea Rep. of
Netherlands
UK
Canada
Australia
Singapore
Germany
Norway
0.744
0.733
0.703
0.698
0.666
0.630
0.606
0.589
0.587
0.585
0.583
0.579
US & Canada off-scale
25
Challenges
• Effort:
– Negligible for remote hosts
– Monitoring host: < 1 day to install and configure, occasional updates to remote
host tables and problem response
– Archive host: 20% FTE, code stable, could do with upgrade, contact monitoring
sites whose data is inaccessible
– Analysis: your decision, usually for long term details download & use Excel
– Trouble-shooting:
• usually re-active, user reports, then look at PingER data
• have played with automating alerts, data will/is available via web services
• Ping blocking
– Complete block easy to ID, then contact site to try and by-pass, can be
frustrating for 3rd world
– Partial blocks trickier, compare with synack
• Derived throughputs poor for well connected sites (<0.1% loss)
• Funding
– “Unfortunately, network management research has historically been very underfunded, because it is difficult to get funding bodies to recognize this as legitimate
networking research.” Sally Floyd, IAB Concerns & Recommendations
Regarding Internet Research & Evolution.
– http://www.ietf.org/internet-drafts/draft-iab-research-funding-00.txt
26
Collaborations & Funding
• 35+ monitoring sites in 13 countries
– Plan to add ICTP Trieste if funded
– Other projects used toolkit, e.g. XIWT, PPCNG/EDG, IAEA …
• SLAC with help from FNAL
• Digital Divide collaboration (MOU) with ICTP, Trieste
– eJDS
– We are looking for a IDRC grant for eJDS and PingER
• Need funding for coming year (DoE funding ended):
– Tasks:
• (0.5 FTE) ongoing maintain data collection, explain needs, reopen connections, open
firewall blocks, find replacement hosts, make limited special analyses, prepare & make
presentations, respond to questions
• (+ 0.5 FTE) extend the code for new environment (more countries, more data
collections), fix known non-critical bugs, improve visualization, automate some of
reports generated by hand today, find new country site contacts, add route histories
and visualization, automate alarms, detect rate limiting earlier, update web site for
better navigation, add more DD monitoring sites/countries, improve code portability,
understand regions better
• Also looking for small grants for helpers in developing countries
• ICFA: show importance to policy makers, funding agencies, identify
sympathetic contacts at agencies, get support
27
Futures
• Get >= 2 hosts/country
• Better/quicker detection of rate limiting
• Have 4 students at GATech rewriting parts of
PingER to improve (reduce effort required for)
day to day management and improve portability
• Submitting a proposal to IDRC for monitoring
Africa and adding a measurement host in
Nigeria
• Need better automated tools to produce graphs
like in this presentation.
28
Summary
• Performance from U.S. & Europe is improving
all over
• Performance to developed countries are orders
of magnitude better than to developing
countries
• Poorer regions 5-10 years behind
• Poorest regions Africa, Caucasus, Central & S.
Asia
• Some regions are:
– catching up (SE Europe, Russia),
– keeping up (Latin America, Mid East, China),
– falling further behind (e.g. India, Africa)
29
More Information
• PingER:
– www-iepm.slac.stanford.edu/pinger/
• MonaLisa
– monalisa.cacr.caltech.edu/
• GGF/NMWG
– www-didc.lbl.gov/NMWG/
• ICFA/SCIC Network Monitoring report, Jan03
– www.slac.stanford.edu/xorg/icfa/icfa-net-paper-dec02
• Monitoring the Digital Divide, CHEP03 paper
– arxiv.org/ftp/physics/papers/0305/0305016.pdf
• Human Development Index
– www.undp.org/hdr2003/pdf/hdr03_backmatter_2.pdf
• Network Readiness Index
– www.weforum.org/site/homepublic.nsf/Content/Initiatives+subhome
30
Extra Slides
31
Visualization
Keep it simple, enable user to do their
own by making data available
• Tables
– Time series (www-iepm.slac.stanford.edu/cgiwrap/pingtable.pl):
• select metric (loss, RTT etc.), time ticks,
packet size, aggregations from/to, etc.
• Color code numbers, provide sort, drill
down to graphs, download data (TSV),
statistical summaries
– Monitoring site vs. Remote sites (wwwiepm.slac.stanford.edu/cgi-wrap/table.pl):
• Select metric, region aggregations
• Drill down to time series, download data
• Graphs
– Select source(s)/destination(s), metric,
time window, SQL selects, graph type
32
Publish information
• www.slac.stanford.edu/cgiwrap/pingtable.pl => tabular
reports, also download data
• Data accessible from MonaLisa
• Implementing web services
access prototype
– Includes: PingER, IEPM-BE, RIPE-tt,
I2 E2Epi OWAMP
– Use GGF/NMWG schema/profile,
e.g.
• path.delay.roundTrip
#!/usr/bin/perl
use SOAP::Lite;
my $characteristic = SOAP::Lite
-> service(‘http://www-iepm.slac.stanford.edu/tools/soap/wsdl/profile_06.wsdl')
-> pathDelayOneWay("tt81.ripe.net:tt28.ripe.net”);
print $characteristic->{NetworkTestTool}->{toolName},"\n";
print $characteristic->{NetworkPathDelayStatistics}->{value},"\n";
33
Recent additions
• Added hosts in Albania, Macedonia,
Serbia/Montenegro, Belarus, Turkey, Armenia, Mexico,
Cuba, Azerbaijan, Tajikistan, Turkeministan,
Kyrgyzstan, Phillippines & Namibia
• Contacts
– Working with contacts in Vietnam, and Tunisia
– Looking for contacts in Kenya, Algeria, Malaysia, Thailand,
Indonesia and Uganda
– Working with Iran site to set up monitor host
• Increased hosts monitored from CERN to give better
European view
– Now monitoring 60 countries
34
Median Packet Loss Seen From nbi.dk
Usage
Examples
Upgrades & ping filtering
% 100 Byte Packet Loss During Day.
50
45
40
To
North America
Ten-155 became
operational on
December 11.
35
30
25
20
Smurf Filters
installed on
NORDUnet’s
US connection.
15
10
5
To Western Europe
1/24/99
1/17/99
1/10/99
1/3/99
12/27/98
12/20/98
12/13/98
12/6/98
11/29/98
8
Peering problems
6
4
DFN closes Perryman
POP and looses direct
peering with ESnet
2
Peering re-established via
Dante at 60 Hudson
35
February
Day of the Month
March
19
17
15
13
9
11
7
5
3
1
27
25
23
21
19
17
15
13
9
11
7
5
3
0
1
Daily Packet Loss (%)
10
11/22/98
Packet Loss between DESY and FNAL in February and
March 2000.
11/15/98
12
11/8/98
11/1/98
0
Usage
Examples
Identify need to
upgrade and
effects
• BW increase by factor 300
• Multiple sites track
• Xmas & summer holiday
• Selecting ISPs for DSL/Cable services for home users
– Monitor accessibility of routers etc. from site
– Long term and changes
• Trouble shooting
– Identifying problem reported is probably network related
– Identify when it started and if still happening or fixed
– Look for patterns:
• Step functions
• Periodic behavior, e.g. due to congestion
• Multiple sites with simultaneous problems, e.g. common problem link/router … 36
– Provide quantitative information to ISPs
At any given time, about 5% of
monitored hosts are doing this, most
in developing countries.
Recently (August 2003) seen an
increase in ping rate limiting
Rate Limiting
gollum.nask.pl
gollum.nask.pl
www.pol34.pl
www.pol34.pl
Loss
Loss
RTT
RTT
RTT
Loss
RTT
Loss
boromir.nask.waw.pl
boromir.nask.waw.pl
2 hosts at same site see sudden steplike increase in loss from < 1% to 2030% at similar time
Another host in Poland sees no
problems, i.e. helps to have another
nearby host
Similar effects for Greek (uoa.gr), Bulgarian (acad.bg), Kazakhstan (president.kz),
Moldovan (asm.md) and Turkish (metud.edu.tr) sites
If no step function or nearby host may not notice, so also compare synack vs ping
Can ping routers along path to see where onset occurs
37
Rate Limiting Moldova
RTT
Loss
lises.asm.md
cni.md
Moldova
Bulgaria
38
Russia
• Russian losses improved by
factor 5 in last 2 years, due
to multiple upgrades
• E.g. Upgrade to KEK-BINP
link from 128kbps to
512kbps, May ’02: improved
from few % loss to ~0.1%
loss
39
Africa
• Hosts in: Ife-Ife/Nigeria, Accra/Ghana,
Kampala/Uganda, Windhoek/Namibia, UCT/ZA,
Johannesburg/ZA, Musselbay/ZA
• Carriers:
– GH uses UUNET/Satworks, NA uses
UUNET/xantic, NG uses TELIANET/NewSkies, UG
uses Level(3)/globalconnex
– ZA varies from site to site: UUNET/ALTERNET,
C&W Telecom S. Africa, CAIS telcom S. Africa
• UG, NA, NG, GH use satellites (> 600ms)
• ZA uses landlines
40
Africa RTT
• Monitored from N.
America & Europe
– Depends on remote site
(not monitoring site)
– Satellite for all except S.
Africa
– Ghana problems
41
West Africa
• Ghana very poor performance
– Sudden increase on August 18th
– Not rate limiting according to synack
– Sometimes get down to a few %
– Route ESnet-UUNET/ALTER.NET
– Losses appear on
last 2 hops in Ghana
• Nigeria better
– Route via
TELIANET/newskies
42
Africa Derived
Throughput
• S. Africa (UCT) best, followed by Uganda, Nigeria and
Ghana
• Throughput to Nigeria site == home DSL/cable
• Throughput to Ghana site === modem dialup
43
44
80
60
40
GDP per Capita (PPP US$)
100
Source: UN
0
Europe
10M
10K
2000
2001
1M
100K
Advanced
1K
Source: From slide prepared by Harvey Newman, presented by David
Williams ICFA/SCIC talk on Serenate report. Data from the TERENA
Compendium
Albania
Moldova,
Ukraine
Georgia
Romania
Estonia
Lithuania
Iran, Islamic
Slovenia
Turkey
Belgium
Greece
Iceland
Poland
Ireland
Croatia
Italy
Denmark
Slovakia
Switzerland
Spain
Austria
Hungary
Finland
Netherlands
Czech
United States
Albania
Moldova,
Ukraine
Georgia
Romania
Estonia
Lithuania
Iran, Islamic
Slovenia
Turkey
Belgium
Greece
Iceland
Poland
Ireland
Croatia
Italy
Denmark
Slovakia
Switzerland
Spain
Austria
Hungary
Finland
Netherlands
Czech
United States
HDI
120
40000
Human Development Index (HDI) Rank
35000
GDP per capita
Source: UN
30000
25000
20000
15000
20
10000
5000
0
NREN Core Network Size (Mbps-km)
Leading
Belgium
In transition
Netherlands
Lagging
Turkey
100
45
Derived throughput~MSS/(RTT*sqrt(loss))
Loss Comparisons with
Development (UNDP)
Weak
correlation
with Human
Development
or GDP
Even
weaker with
education &
literacy
46
Digital Access Index
• DAI (from ITU 2002)
includes:
– Availability of
infrastructure
– Affordability of access
– Education level
– Quality of ICT services
– Internet usage
Top DAI
countries
Good positive correlation between
throughput and DAI
Care needed with shorter RTTs
47
Network Readiness
• NRI from Center for International Development, Harvard U.
http://www.cid.harvard.edu/cr/pdf/gitrr2002_ch02.pdf
• Using derived throughput ~ MSS / (RTT * sqrt(loss))
– Fit to exponential is better
Internet for all focus
5.92
5.79
5.74
5.58
5.51
5.44
5.35
5.33
5.31
5.29
5.28
5.22
5.18
5.10
A&R focus
NRI Top 14
Finland
US
Singapore
Sweden
Iceland
Canada
UK
Denmark
Taiwan
Germany
Netherlands
Israel
Switzerland
Korea
49