Transcript Slide 1

NMS requirement/recommendations

Belgrade, October 21 2009 Vidar Faltinsen, UNINETT

This talk reflects lessons learned up through the years (19 years) of NMS development at UNINETT and NTNU (the Norwegian University of Science and Technology) and other universities around Norway Lessons learned from a number of commited people always aiming to improve network operations

2

Our context

    The network is complex  A lot of equipment  Heaps of traffic around the clock No system is perfect  Errors

will

occur – incidents

will

Motto: be proactive and ahead hit us  The user should not call you – you should be the first to know!

Keep in mind: If information is good… (posted at the right time, kept up to date)… …the user is (more) patient!

3

Avoid a monolithic NMS

 Not an absolute rule, but be a sceptic  If the system is too massive it tends to set the agenda.  You should shape the system, not the other way around.

  If too much resources must be invested into understanding the system…  …then even more resources must be put into accommodating the system to your needs    The NMS has no intrinsic value…  …it should be a useful tool for

you

But remember nothing is for free – you must in any case invest in understanding what your tools actually do

4

Not one tool - a set of tools

    Special purpose tools with limited scope is good  Example of tool categories:  inventory systems     trouble ticket systems status monitors measurements (and threshold monitors) server/services focused   netflow analysis security-focused  configuration tools  simulation Tools should (ideally) not overlap Have a well defined single authority as source for your data sets, i.e.;   similarly for our locations (with attributes), etc, etc Autodetection is good  the set of equipment (with attributes) we manage is defined in

one place

But in a controlled environment (be aware of weak SNMPv2 security)

5

Avoid complexity

   A given tool should manage your whole domain Avoid a hierarchy of managers if possible  snmp polls can be done in parallel   Bandwidth is not a bottleneck Throw ”iron” (CPU, memory, disk I/O, battery backed disk controller) at NMS utilization problems  If necessary segregate database on a separate system, possibly also webfront …but consider redundancy (more later)

6

Place your monitor strategic

    A monitor placed in the periphery of your network is more likely to be cut off  place in a central (network wise) location  redundant network access (VRRP, HSRP…) Redundant power, incl redundant source of source (UPS/ideally standby generator) Monitor the monitor!

Use SMS for alarms in addition to email  Place the SMS sending device physically connected to the NMS

7

Classify your alarms

   Think through: What are the most vital alarms? What is less important?

Make sure the most vital alarms actually reach you!

 and not drown in 10.000 other alarms…  or stay saturated in an overworked NMS… Red and green lamps are good  in large environments in a hierarchal display

8

Use a single event/alarm system

    The set of tools/monitors you use should all report to one event/alarm system  i.e. using snmp traps or email or… The central event/alarm system should scale  coping with many events  make priorities / sort out important alarms Correlate events – but be realistic  Detect ”in shadow” scenarios    Classify stateful alarms in pairs (down/up) Suppress flapping alarms (line going up,down,up,down…) Use hysteresis for threshold alarms. Set high and low tresholds.  Again: keep robustness.  Rather one alarm to many than missing an important one Allow a flexible setup for alarm profiles  every person tends to have his own preferences…    (but have a company policy) alarms at night/weekend vs daytime important alarms vs less important alarms within vs outside the person’s scope of duty/responsibility

9

Redundant NMS

     Single point of failure is never good Complete redundancy is not realistic Too expensive Complexity may bite you 1.

Three possible ways to go: Monitor the monitor. Have a spare machine. Have backup. 24x7 guard on duty. Replace ASAP.

2.

3.

Do continous live replication of the NMS machine to a hot spare.  Manually (with few steps) set the hot spare in operation (inherit the NMS IP address) Use anycast combined with live replication  Secondary NMS automatically takes over when primary NMS dies

10

Without numbers you are nothing

      When an incident occurs – do you have enough data to investigate – and actually pinpoint the cause?

Disk is cheap Collect heaps of statistical data Have a scheme for compressing data as time goes (RRD/Stager method) Focus on good search tools, reports and visualisation methods to make traffic/statistical anomalies easy to detect  Isolation and classification of an error tends to consume most of the recovery time Autodection of thresholds and more complex anomaly detection is even better  Remember to moderate the total flow of alarms (classify alarms)

11

Logs are gold, scripts as well

  Log, log, log  Syslog is also a management system  Small (shell) scripts can be gold  A good idea can be only a few code lines away…  A culture that motivates creativity, allows continous implementation of new scripts/add-ons will step by step improve the overall management process!

12

Commit to open source

   

Open source development works Sharing ideas and running code widely improves the quality Distributed contributions can speed up implementation (Poorly documented) single person projects will eventually die

13

Adopt good naming standards

   Do not underestimate the value of sound names for your equipment, rooms and locations The name of the device should in itself give an idea of what the device is (does) and where it is placed  Example: mtfs-272-sw (a switch in area ”mtfs”, wiring closet ”272”) Also use a thought-through naming standard for router interfaces and switch ports

14

NMS Security

    Restrict access to NMS to authorized crew only  both network access and physical access Isolate management IP address of switches and base stations to dedicated subnets Firmly restrict SNMP access to the network equipment – only from the NMS(es).  remember SNMP v2 security is weak Be even more restrictive if you allow/use SNMP Write  consider SNMP v3 or Netconf

15

MIB requirements

 Your network equipment should support:          RFC 3418: SNMPv2-MIB RFC 2863: IF-MIB (system) (interfaces, incl. 64 bit counters) RFC 4293: IP-MIB (IP-interfaces and ARP; IPv4 and IPv6) RFC 4133: ENTITY MIB serial numbers)  Not supported by Juniper  (modules, optics, software, RFC 4188: BRIDGE-MIB RFC 3635: Etherlike-MIB (bridge table) RFC 4363: Q-BRIDGE MIB (bridge table per vlan, vlan config)  Not supported by Cisco  (duplex) RFC 2368: MAU-MIB (medium)  equipment support seems scarse  (HP has support) Your NMS should whenever possible use standard/IETF MIBs rather than vendor proprietory MIBs

16

Key points – in summary

        Be proactive Detect important alarms early Inform the users Log, log, log (snmp collect) Use a number of tools Adopt good naming standards Value the engineer – small scripts are gold Educate your crew!

(in both NMS operations and procedures)

17