Evaluating NGI performance Matt Mathis [email protected] Evaluating NGI Performance • How well is the NGI being used? • Where can we do better?

Download Report

Transcript Evaluating NGI performance Matt Mathis [email protected] Evaluating NGI Performance • How well is the NGI being used? • Where can we do better?

Evaluating NGI performance
Matt Mathis
[email protected]
1
Evaluating NGI Performance
• How well is the NGI being used?
• Where can we do better?
2
Outline
• Why is this such a hard problem?
– Architectural reasons
– Scale
• A systematic approach
3
TCP/IP Layering
• The good news:
– TCP/IP hides the details of the network from
users and applications
– This is largely responsible for the explosive
growth of the Internet
4
TCP/IP Layering
• The bad news:
– All bugs and inefficiencies are hidden from
users, applications and network administrators
– The only legal symptoms for any problem
anywhere are connection failures or less than
expected performance
5
Six performance problems
• IP Path
– Packet routing, round trip time
– Packet reordering
– Packet losses, Congestion, Lame HW
• Host or end-system
– MSS negotiation, MTU discovery
– TCP sender or receiver buffer space
– Inefficient applications
6
Layering obscures problems
• Consider: trying to fix the weakest link of
an invisible chain
• Typical users, system and network
administrators routinely fail to “tune” their
own systems
• In the future, WEB100 will help…
7
NGI Measurement Challenges
• The NGI is so large and complex that you
can not observe all of it directly.
• We want to assess both network and endsystem problems
– The problems mask each other
– The users & admins can’t even diagnose their
own problems
8
The Strategy
• Decouple paths from end-systems
– Test some paths using well understood endsystems
– Collect packet traces and algorithmically
characterize performance problems
9
Performance is minimum of:
• TCP bulk transport (path limitation):
 MSS  C 
Rate  

 RTT  p 
C0.7
• Sender or receiver TCP buffer space:
 MSS 
Rate  
Size
 RTT 
• Application, CPU or other I/O limit
10
Packet trace instrumentation
• Independent measures of model:
– Data rate, MSS, RTT and p
– Measure independent distributions for each
• Detect end system limitations
– Whenever the model does not fit
 MSS  C 
Rate  

 RTT  p 
11
The Experiments
• Actively test a (small) collection of paths with
carefully tuned systems
• Passively trace and diagnose all traffic at a small
number of points to observe large collections of
paths and end systems.
• [Wanted] Passively observe flow statistics for
many NGI paths to take a complete census of all
end systems capable of high data rates.
12
Active Path Testing
• Use uniform test systems
– Mostly Hans Werner Braun’s AMP systems
– Well tuned systems and application
– Known TCP properties
• Star topology from PSC for initial tests
– Evolve to multi star and sparse mesh
• Use passive instrumentation
13
Typical (Active) Data
• 83 paths measured
• For the moment assume:
– All host problems have been eliminated
– All bottlenecks are due to the path
• Use traces to measure path properties
– Rate, MSS, and RTT
– Estimate window sizes and loss interval
• Sample has target selection bias
14
Data Rate
1
0.9
0.8
CDF
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
70
80
M bits/s
15
Data Rate Observations
• Only one path performed well
– (74 Mbit/s)
• About 15% of the paths beat 100MB/30s
– (27 Mbit/s)
• About half of the paths were below old
Ethernet rates
– (10 Mbit/s)
16
Round Trip Times
1
0.9
0.8
CDF
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
RTT (ms)
17
RTT Observations
• About 25% of the RTTs are too high
(PSC to San Diego is ~70 ms)
– Many reflect routing problems
– At least a few are queuing (traffic) related
18
Loss Interval (1/p)
1
0.9
0.8
0.7
CDF
0.6
0.5
0.4
0.3
0.2
0.1
0
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Packets between losses
19
Loss Interval Observations
• Only a few paths do very well
– Some low-loss paths have high delay
• Only paths with fewer losses than 10 per
million are ok
• Finding packet losses at this level can be
difficult
20
Passive trace diagnose
• Trace Analysis and Automatic Diagnosis
(TAAD)
• Passively observe user traffic to measure the
network
• These are very early results
21
Example Passive Data
•
•
•
•
Traffic is through the Pittsburgh GigaPoP
Collected with MCI/NLANR/CAIDA
OC3-mon and coralreef software
This data set is mostly commodity traffic
Future data sets will be self weighted NGI
samples
22
Observed and Predicted Window
• Window can be observed by looking at TCP
retransmissions
• Window can be predicted from the observed
interval between losses
• If they agree the flow is path limited
– The bulk performance model fits the data
• If they don’t, the flow is end system limited
– Observed window is probably due to buffer limits but
may be due to other bottlenecks
23
24
Window Sizes
80000
70000
Population
60000
50000
Observed
40000
Predicted
30000
20000
10000
0
0
10
20
30
40
50
Window (k Bytes)
25
Window Sizes
1
0.9
0.8
CDF
0.7
0.6
Observed
0.5
Predicted
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Window (kBytes)
26
Observations
• 60% of the commodity flows are path
limited with window sizes smaller than
5kBytes
• Huge discontinuity at 8kBytes reflects
common default buffer limits
• About 15% of the flows are affected by this
limit
27
Need NGI host census
• Populations of end systems which have
reached significant performance plateaus
• Have solved “all” performance problems
• Confirm other distributions
• Best collected within the network itself
28
Conclusion
• TCP/IP layering confounds diagnosis
– Especially with multiple problems
• Many pervasive network and host problems
– Multiple problems seem to be the norm
• Better diagnosis requires better visibility
– Ergo WEB100
29