IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep 1517 ‘04 www.slac.stanford.edu/grp/scs/net/talk03/scidac-pinger-sep04.ppt.

Download Report

Transcript IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep 1517 ‘04 www.slac.stanford.edu/grp/scs/net/talk03/scidac-pinger-sep04.ppt.

IEPM/PingER Project
Les Cottrell, SLAC
DoE 2004 PI Network Research Meeting, FNAL Sep 1517 ‘04
www.slac.stanford.edu/grp/scs/net/talk03/scidac-pinger-sep04.ppt
1
Outline
• PingER
– Purpose etc.
– Methodology
– Results
• PingER-NG ≡ IEPM-BW
– Low network impact bandwidth tool (INCITE)
– Traceroute viz
– Topology (INCITE)
2
PingER
• Uses ping to provides lightweight performance
monitoring:
– < 100bits/s per pair measured
– No software to install at remote sites
– Measures loss, RTT, reachability, jitter
• For planning, trouble shooting
• Originally (1990s) for HENP sites
• More recently also to characterize the Digital
Divide
– ICFA/SCIC, Internet2 Hard to Reach Places, WSIS,
3
ICTP/eJDS
Methodology
• Use ubiquitous ping
• Each 30 minutes from monitoring site to target :
– 1 ping to prime caches
– by default send11x100Byte pkts followed by
10x1000Byte pkts
• Low network impact + no software to install / configure /
maintain at remote sites + no passwords / accounts
needed = good for developing sites / regions
• Record loss & RTT, (+ reorders, duplicates)
• Derive throughput, jitter, unreachability …
4
Architecture
SLAC
Archive
WWW
HTTP
Reports & Data
Ping
Archive
Monitoring
Cache
Monitoring
Remote
Remote
FNAL
~35
Monitoring
Monitoring
1 monitor host
remote host pair
Remote
~550
Remote
• Hierarchical vs. full mesh
5
Coverage
• In last 9 months added:
– Several sites in Russia (thanks GLORIAD)
– Many hosts in Africa (5=>36 now in 27 out of 54 countries)
– Monitoring sites in Pakistan and Brazil (Sao Paolo and Rio)
• Now monitoring 650 sites in 115 countries
• Working to install monitoring host in Bangalore, India
Monitoring site
Remote site
6
C. Asia, Russia, S.E. Europe,
L. America, M. East, China: 45 yrs behind
World View
S.E. Europe, Russia: catching up
Latin Am., Mid East, China: keeping up India, Africa: 7 yrs behind
India, Africa: falling behind
TCP throughput measured from N.
America to World Regions
C. Asia (8)
Latin America (37)
50% Improvement/year
~ factor of 10 in < 6 years
10000
Edu (141)
1000
1000
Europe(150)
Canada (27)
100
100
Mid East (16)
S.E.
Europe (21)
10
10
Caucasus (8)
Dec-04
Dec-03
1
Dec-02
Dec-01
Africa (30)
Dec-00
India(7)
Dec-99
Dec-97
Dec-96
1
Jan-96
Russia(17)
Dec-98
China (13)
Jan-95
Derived TCP throughput in
KBytes/sec
Important
for policy
makers
10000
From the PingER project, Aug 2004
7
View from CERN
• Confirms view from N. America
Derived TCP throughput Kbits/s
100000
TCP throughput from CERN to World
Regions
10000
Europe
1000
SE Europe
N America
M East
India
100
L America
RussiaChina
10
Africa
From the PingER project August 2004.
1
Feb-98
Jun-99
Oct-00
Mar-02
Jul-03
8
Dec-04
From Developing Regions
Africa
Balkans
Europe
N. America
S. America
TCP throughput from Novosibirsk to world regions
Novosibirsk
Novosibirsk
Derived throughput in Knits/s
10000
N. America
Australasia
E. Asia
M. East
Russia
S. Asia
big loss increase to Moscow (from < 1% to 2-3%)
Moscow
Japan/China
1000
NSK to Moscow used to be
OK but loss went up in Sep.
2003
GLORIAD may help
100
10
1
Sep-02
Dec-02
Mar-03
Jun-03
Oct-03
Jan-04
Apr-04
Aug-04
Derived TCP throughput KBytes/s
TCP throughput measured from Brazil to World Regions
10000
Brazil (Sao Paolo)
Latin America
Europe
1000
N. America
100
10
Jan-04
Africa
Russia
Feb-04
E. Asia
S. America
Mar-04
Apr-04
Europe
S. Asia
May-04
Jun-04
N. America
Jul-04
Aug-04
As expected Brazil to L. America
is good
Actually dominated by Brazil to
Brazil
To Chile & Uruguay poor since
9
goes via US
Technology Achievement
Index (TAI)
• TAI captures how well a country is creating and diffusing
technology and building a human skills base.
• TAI from UNDP hdr.undp.org/reports/global/2001/en/pdf/techindex.pdf
TAI top 12
Finland
US
Sweden
Japan
Korea Rep. of
Netherlands
UK
Canada
Australia
Singapore
Germany
Norway
0.744
0.733
0.703
0.698
0.666
0.630
0.606
0.589
0.587
0.585
0.583
0.579
US & Canada off-scale
10
PingER-NG = IEPM-BW
• Need measurement tools for high-performance
paths/applications
– BER 10-8 takes > day to see 1 loss
– Ping losses ≠ TCP losses
• Build infrastructure to
– Measure with:
• Iperf (TCP mem-to-mem), GridFTP, bbftp
• Lightweight packet pair dispersion
– Evaluate measurement tools
11
Low impact bandwidth measurement
• Goals:
– Make a measurement in < second rather than tens of
seconds
– Injects little network traffic
– Provide reasonable agreement with more intense methods
(e.g. iperf)
• Enables:
– Measurements of low performance links (e.g. to developing
countries)
– Helps avoid need for scheduling
– More frequent measurements (minutes vs. hours)
– Lower impact more friendly
12
Low impact Bandwidth
• Use 20 packet pairs to roughly estimate dynamic bw Capacity &
Xtraffic, then Available = Capacity – Xtraffic
– Capacity min pair separation; Xtraffic packet pair dispersion
Dynamic bandwidth capacity (DBC)
Iperf
Available bandwidth =
DBC – X-traffic
Cross-traffic
ABwE SLAC to
Caltech Mar 19,
2004
13
Achievable throughput
& file transfer
• IEPM-BW
– High impact (iperf, bbftp, GridFTP …) measurements 90+-15 min intervals
Fwd route change
Iperf
abing
bbftp
iperf1
Min RTT
Rev route change
Avg RTT
Select focal area
14
Anomalous Event Detection
• Too many graphs to scan by hand, need to automate
– SLAC Caltech link performance dropped by factor 5 for ~ month before
noticed, fixed within 4 hours of reporting
• Looking for long-term step down changes in bandwidth
• Use modified “plateau” algorithm from NLANR
– Divide data into history & trigger buffer
– If y < mh – b * sh then trigger, else history (b=2)
• When trigger buffer fills: if mt < d * mh, then have an event
15
Route table Example
• Compact so can see many routes at once
History navigation
Route # at start of day,
gives idea of root stability
Multiple route changes
(due to GEANT),
later restored to
original route
Mouseover for hops & RTT
Available bandwidth
Raw traceroute logs for debugging
Textual summary of traceroutes for email to ISP
Description of route numbers with date last seen
User readable (web table) routes for this host for this day
16
Another example
Get AS information for routes
Level change
Host not pingable
TCP probe type
ICMP checksum
error
Intermediate
router does not
respond
17
Topology
• Choose times and hosts and submit request
Hour of day
Alternate route
SLAC
ESnet
Nodes colored by ISP
Mouseover shows node names
Click on node to see subroutes
Click on end node to see its path back
Also can get raw traceroutes with AS’
GEANT
Alternate rt
JAnet
IN2P3
CESnet
CCLRCDL
L
R
18
SLAC
P P
ESnet
CENIC
P
Abilene
P
P
Supernet
Bandwidth from SLAC to Supernet.org June 2, 2004
Cap
1000
mh
800
mh=954Mbits/s, mt=753Mbits/s
(mh-mt)/(sqrt((oh**2+ot**2)/2))=2.4
mh - 2 oh
600
Abw
200
Xtr
0
b sensitivity = 2;
d
l history buffer length = 600
ttrigger buffer length = 60
Route changes
19
6/3/04
0:00
400
6/2/04
0:00
Bandwidth in Mbits/s
SOX
Putting it together
New features in works
(with NIIT)
• Improve new site set-up tools
• Improve management
– Discover non working links faster
• Improve access to data and meta data
– Provide data base with lat/long, country etc.
– Add web services access
• Improve visualization:
– Provide map with drill down to node information
– Automate production of long term trend plots for regions
– More node selection capabilities
• Traceroute measurement and analysis
20
More
• PingER Project
– http://www-iepm.slac.stanford.edu/pinger/
– IEEE Communications Magazine on Network Traffic
Measurements and Experiments.
• ICFA/SCIC Network Monitoring report, Jan ‘04
– http://www.slac.stanford.edu/xorg/icfa/icfa-netpaper-jan04/
• IEPM-BW
– http://www-iepm.slac.stanford.edu/
21