Offense - Northwestern Networks Group

Download Report

Transcript Offense - Northwestern Networks Group

Profiling Network Performance for
Multi-Tier Data Center Applications
Offense by –
Balasaheb Bagul
Rumou Duan
1
Polling Interval – How much is it?
• Errors when intervals smaller and greater than 500ms.
2
SNAP Configuration - I
• 8K hosts and just 700 applications!
• In section 3.2, collect only discrete socket-call
logs (“99.8% of connections has low throughput less than 1 MB/s”)
– 1GB of data per host per day and 1 TB per week!
• Continuous TCP logs are completely ignored
– With pooling interval is at an average of 500ms
– 120 bytes per connection per pull?
3
SNAP Configuration - II
• Where are they analyzing data collected?
Hosts or centralized server?
– Centralized (8000*1) GB per day of just socket logs
– How and when do you send this data to the
central server?
4
SNAP Configuration - III
• Sockets to Processes mapping
– Done when the sockets are open
– Processes can create new sockets and close
old ones dynamically
– So they have to do this mapping in that
short frame of time and continuously.
5
CPU Overhead – I (At each host)
• Polling TCP stats + Reading TCP table = 5%+5% < 10%
• Collecting Socket logs: 1.6 %. TCP performance classifier?
6
Fine-grained profiling?
TCP Incast Problem
In paper: “For example, the TCP incast
problem [3], caused by micro bursts of traffic
at the timescale of tens of milliseconds, is not
even visible in SNMP data.”
However, based on Figure 8, the CPU
overhead is really large.
7
CPU Overhead – II (At Server)
•
•
•
•
Cross-Connection Correlation is centralized
How will it scale? – No mention about it!
How it works?
“SNAP has full knowledge of network
topology, the network-stack configuration,
and mappings of applications to servers.”
8
SNAP Validation
• Test beds include only 36 hosts!
• Extremely small data collected
• ACC (average correlation coefficient) = 0.4
– Why?
– Are all the connections with ACC just above 0.4
facing problems?
9
Advices to DC Operator – Seriously!
1. “Operators should schedule backup jobs
more carefully to avoid triggering network
congestion”
– 2 am to 4 am is the most idle time to do bulk
transfers! -> So why change it?
2. “Operators should disable delayed ACK or
reduce it significantly”
– What about time critical application?
10
Advices to Developers – Again
Seriously!
• Claim: “Developers can use these logs to quickly
find the root cause of performance problems.”
• Problems that SNAP detected required several
days and weeks to solve!
– Do developers have weeks to spare?
– So does this mean that SNAP’s data is not efficient for
the developers
• “There should be better scheduling of traffic
across applications…”
– How to do it?
11
Conclusion
• Not scalable due to centralized server
• Huge data collected per host per day
– Continuously
• Get it to work with more applications!
12
Thank you!
13