Transcript Document

An update on HNTES
M. Veeraraghavan
University of Virginia (UVA)
[email protected]
Chris Tracy
ESnet
[email protected]
Feb. 24, 2014
Thanks to the US DOE ASCR for grants DE-SC0002350 and DESC0007341 (UVA), and for DE-AC02- 05CH11231 (ESnet)
Thanks to Brian Tierney, Chin Guok, Eric Pouyoul
UVA Students: Zhenzhen Yan, Tian Jin, Zhengyang Liu,
Hanke (Casey) Meng, Ranjana Addanki, Haoyu Chen, and Sam Elliott
1
Outline
• Three main contributions
– HNTES
– AFCS
– QoS provisioning
• Goal: Operationalize AFCS on ESnet5
• Future work: feedback?
HNTES: Hybrid Network Traffic Engineering System
AFCS: Alpha Flow Characterization System or EFCS
2
Contributions
• HNTES: Tested the hypothesis that if IP address
prefixes extracted from offline analysis of
completed alpha flows are used to redirect future
alpha flows to traffic-engineered MPLS LSPs, the
solution will be effective
• AFCS: Characterize alpha flows (size, duration)
• QoS provisioning: Requested support for rateunspecified circuits: policing can throttle throughput
– Two new classes added in new ESnet QoS document
• Best-Effort Circuit Class (different from Best-Effort Class)
• Assured Forwarding Class
3
Publications
Published
• Z. Yan, M. Veeraraghavan, C. Tracy, C. Guok, “On how to provision
Quality of Service (QoS) for large dataset transfers,” CTRQ 2013,
Best Paper Award
• T. Jin, C. Tracy, M. Veeraraghavan, Z. Yan, “Traffic Engineering of
High-Rate Large-Sized Flows,” IEEE HPSR 2013
• Z. Liu, M. Veeraraghavan, Z. Yan, C. Tracy, J. Tie, I. Foster, J.
Dennis, J. Hick, Y. Li and W. Yang, “On using virtual circuits for
GridFTP transfers,” IEEE SC2012, Nov. 10-16, 2012
• Z. Yan, C. Tracy, M. Veeraraghavan, “A hybrid network traffic
engineering system,” IEEE HPSR 2012, June 24-27, 2012
Submitted
• Two journal papers and one conference paper
4
HNTES vs. AFCS
• Goal of HNTES was to identify IP
addresses of data transfer nodes that
were sourcing/sinking alpha flows
– Analyzes only single NetFlow records (one
generated per minute per flow)
• Goal of AFCS: characterize the size, rate
and duration of alpha flows
– Requires concatenation of multiple NetFlow
records to characterize individual flows
– Not aggregation as done by commercial tools
5
AFCS
• AFCS work is newer: current focus
• Easier to operationalize than HNTES
– HNTES requires additional step to
redirect  flows to AF class through
firewall filter config.
– Needs new work for ALUs – previous
QoS experiments on Junipers
• Goal: Characterize alpha flows
– Determine size (bytes), duration, rate
6
AFCS Algorithm
• Find NetFlow records for all gamma flows
– gamma flow is defined to be a flow that has at
least one “Large” NetFlow record
– Large NetFlow record: size > threshold (1 GB)
– Maximum duration of a NetFlow record is 1 min
because of “active timeout interval” value
configured in ESnet routers
• Start concatenation procedure to reconstruct
“flows” out of “records”
• Use size/rate thresholds to find alpha flows
– e.g., 10 GB and 200 Mbps
7
Step 1: Finding NetFlow
records of gamma flows
• Find all Large Netflow records
• Extract five-tuple IDs of these Large
records
– srcIP, dstIP, srcport, dstport, protocol
• Find all Small NetFlow records
corresponding to those five-tuple IDs
8
Step 2: Concatenation procedure
(using example)
One  flow
difference: 889.798 sec
difference: 180 ms
difference: 40665 sec
• All records (reports) observed on same day
• Time gap between last-pkt TS of one record and
first-pkt TS of next record < 1 min for grouping
9
Step 3: find alpha flows
• Total size of each gamma flow
– Sum of sizes of concatenated NetFlow records
and multiply by 1000
– Packet sampling rate: 1-in-1000
• Total duration of each gamma flow
– Last packet timestamp of last NetFlow record
minus first packet TS of first NetFlow group in
group
• Rate: size/duration
• Alpha flows: gamma flows whose size and
rate exceed preset thresholds
10
Validated algorithm
• Because of NetFlow packet sampling rate, we needed to
validate our size/duration computation algorithm
• Found GridFTP logs from NERSC data transfer node
• Found corresponding NetFlow records from ESnet router
• Found additional NetFlow records with same flow IDs
• Applied algorithm to find size/duration of flows from
NetFlow records
• Recreated “sessions” from GridFTP transfer logs (-fast
option: multiple files transferred on one TCP connection);
found session size and compared with flow size determined
from NetFlow records
• Accuracy close to 100% but decreases with size
• Size accuracy ratio > 100% for smaller sizes
11
NetFlow observation points (OP)
(data obtained from ESnet4: May-Nov. 2011)
router-1, router-2: BNL and NERSC PE
router-3: sunn-cr1 (REN peerings)
router-4: eqx-sj (commercial peerings)
12
Characterization of  flows
(May-Nov. 2011 data)
Provider edge routers
(downloads)
Core routers (uploads to
DOE labs)
router-1
router-2
router-3
router-4
REN peerings Commercial
peerings
bnl
nersc
sunn-cr1
eqx-sj
#  flows
28685
27963
2516
212
# unique  flow
src-dst pairs
1479
1611
193
158
max size ( flow) 633.3GB
811.6GB
233.6GB
112.8GB
max rate ( flow) 5.1Gpbs
5.7Gbps
0.97Gbps
0.78Gbps
8.8hr
3.87hr
2.77hr
longest  flow
9hr
Size (MB) of  flows
(May-Nov. 2011 data)
Provider edge routers Core routers (uploads to
(downloads)
DOE labs)
811 GB
router-1
router-2
router-3
REN
peerings
router-4
Commercial
peerings
Min
1001
1001
1005
1010
1st Qu.
1149
1540
4050
1203
Median
1275
2869
4360
1532
Mean
2513
9046
17540
3612
3rd Qu.
1701
8768
21380
3772
Max
633300
811600
233600
112800
IQR
552
7227
17330
2569
CV
5.20
2.56
1.4
2.43
skewness
25.35
12.56
2.37
10.09
112 GB
14
Duration (s) of  flows
(May-Nov. 2011 data)
Provider edge
routers (downloads)
Core routers (uploads to DOE labs)
router-1
router-2
router-3
REN peerings
router-4
Commercial peerings
Min
4.212
8.044
9.55
12.03
1st Qu.
41.85
60.94
190.9
54.97
Median
9 hours
Mean
54.17
121.1
272
94.28
122.8
414.2
1098
235.6
3rd Qu.
73.58
398.9
1169
227.6
Max
32460
31910
13940
9978
IQR
31.73
338.01
977.94
172.67
CV
7.392
2.34
1.50
3.18
skewness
23.767
10.33
2.32
10.99
mean is above the median under right (positive) skew
2.8
hours
15
Rate (Mbps) of  flows
(May-Nov. 2011 data)
Provider edge
routers (downloads)
Core routers (uploads to DOE labs)
router-1
router-2
router-3
REN peerings
router-4
Commercial peerings
Min
11.7
3.6
34.6
49.2
1st Qu.
160.9
147
117.6
130.9
Median
199.3
181.9
132.6
156.4
Mean
245.2
230.9
159
182.7
3rd Qu.
258.9
252.1
159.2
195.8
99%
881
944
503
649
Max
5154
5757
979
776
CV
0.71
0.72
0.56
0.61
skewness
7.36
3.95
3.82
2.86
16
Characterization of  flows
(May-Nov. 2011 data)
• Results: # of  flows over 214 days
(sensitivity to size-rate threshold)
size
rate
Router-1 Router-2 Router-3 Router-4
10GB
100Mbps
526
5460
726
3
10GB
150Mbps
399
4121
297
1
10GB
180Mbps
375
3037
124
0
10GB
200Mbps 357
2443
92
0
50GB
200Mbps 19
505
28
0
80GB
500Mbps 0
20
0
0
Persistency measure
(May-Nov. 2011 data)
• CDF of number of  and  flows per src/dst pair
(router-1 plot close to router-2 plot and hence
omitted)
 flows
 flows: > 5 GB and 100 Mbps
Discussion
• Largest-sized flow rate:301 Mbps, fastest-flow
size:7.14 GB, and longest-flow size: 370 GB
• At the low end, one 1.9 GB lasted 4181 sec
• High skewness in size for downloads
• Larger-sized flows for downloads than uploads,
and more frequent
• Max number of  and  flows per src-dst pair were
(2913, 1596) for router-2 (nersc)
• The amount of data analyzed is a small subset of
our total dataset, both in time and number of
routers analyzed. Concatenating flows is
somewhat of an intensive task, so we tried to
choose routers that would be representative.
19
Potential application
• Find src-dst pairs that are experiencing
high variance in throughput to initiate
diagnostics and improve user experience
– In the 2913 -flow set between same src-dst
pair, 75% of the flows experienced less than
161.2 Mbps while the highest rate experienced
was 1.1 Gbps (size: 3.5 GB).
– In the 1596 -flow set, 75% of the flows
experienced less than 167 Mbps, while the
highest rate experienced was 536 Mbps (size:
11 GB).
20
Other applications
• Identify suboptimal paths
– science flows should typically enter
ESNet via REN peerings, but some of
the observed alpha flows at eqx-sj could
have occurred because of BGP suboptimal configurations
– correlate AFCS findings with BGP data
• HNTES: traffic engineering alpha
flows
21
Ongoing work
• ESnet4 upgrade to ESnet5 (2012)
– Juniper to ALU routers
– Netflow v5 to NetFlow v9
– Flow-tools to nfdump
• Rewrote AFCS code
• Running on an ESnet VM
– CryptoPAN IP address anonymization
• Demo: D3.js GUI (preliminary)
22
Numbers for Oct. 1-Nov. 12, 2013 data
from bnl-mr2 (24255  flows)
Min
1000
8
15
1st Qu.
1992
19.2
227
Median
2300
22.25
651
Mean
4147
80.18
1217
3rd Qu.
4839
65
938.8
90%
6400
259
2626
99%
23396
381
9249
99.9%
40721
1904
10129
Max
313600
36190
10670 (size
over estimate)
IQR
2847.25
45.8
711.8
CV
1.44
5.15
1.48
skewness
17.19
61
2.87
Increased relative to 2011 data
Duration (sec) Rate (Mbps)
Increased relative to 2011 data
Size (MB)
Not the same flow
23
Feedback?
• Goal: Integrate operational AFCS output
with my.es.net
• Current plan
– Run software every night, compute numbers for
gamma flows observed that day
– Pre-calculate last 24-hours, 7-days, last 30-days
json files for quick visualization
– Per-site alpha flows (configurable thresholds)
– Store gamma-flow information in SQL database
for easier querying of other types of requests
24