FileNewTemplate

Download Report

Transcript FileNewTemplate

Enterprise Performance Management - Chasing
the Yellow Lights Before They Become Red
An Overview of Performance Monitoring
Speaker - Joseph Convery
Associate Fellow – Verizon Wireless
[email protected]
Speaker Bio
Joseph Convery 25+ Years in IT
Current
-Associate Fellow – Verizon Wireless
- Adjunct Instructor – Molloy College (Rockville Center Long Island)
Technical History
•
Associate Fellow –Verizon Wireless Orangeburg, NY ▪ March, 2013 – Present
•
Principle Member of Technical Staff –Verizon Wireless Orangeburg, NY ▪ March, 2004 – 2013
•
Manager – Verizon Wireless, Walnut Creek, CA ▪ October 2001 – February, 2003
•
Senior Member of Technical Staff – Verizon Wireless., Walnut Creek, CA ▪ February 1997 –
•
Senior Consultant – Coopers and Lybrand, San Francisco, CA ▪ 1994 – 1997
•
Senior Program Analyst – Merck Medco, Montvale, NJ ▪ 1991 – 1994
•
Mainframe Programmer – Citicorp, Melville , NY▪ 1988 – 1991
This Presentation
•
It is not a new wiz-bang gigantic theoretical process or methodology framework
that would cause the enterprise to change direction.
•
It will compare Performance v.s. Fault Management and try to show distinctions
between the two.
•
It will show examples (from bad to great) on how performance management
does or does not exist in many companies
•
It will provide items to consider and ways to use the data being collected
•
It will possibly shed some like on things that you may not have considered
Fault Management
In network management, fault management is the set of functions that detect, isolate, and correct
malfunctions in a telecommunications network, compensate for environmental changes, and
include maintaining and examining error logs, accepting and acting on error detection
notifications, tracing and identifying faults, carrying out sequences of diagnostics tests, correcting
faults, reporting error conditions, and localizing and tracing faults by examining and manipulating
database information.
When a fault or event occurs, a network component will often send a notification to the network operator using a
protocol such as SNMP. An alarm is a persistent indication of a fault that clears only when the triggering condition
has been resolved1
By its own definition Fault management is when the company or business is loosing money.
Yes, Fault management is incredibly important but so is preventing the fault from ever happening.
Reference
http://en.wikipedia.org/wiki/Fault_management
NMS in Some Companies
Companies try to reduce MTTR by fixing outages quicker
Companies try to reduce MTTR by preventing outages
from happening
Fault Management
Fault
Performance
Performance
Management
Capacity
Capacity Planning
and Data Analytics
OLD WAY OF DOING THINGS
NEWER WAY OF DOING THINGS
Fault vs. Performance
Performance Level
Events
Fault Level Events
Business impact is being felt.
Leading up to full outage
Performance Management
Fault Management
Performance Management
Network Planning
Capacity Planning
Performance Tuning
What is performance Management
and where does it fit within NMS
(Network Management Systems)
•It is a constant evolving process
•Endeavors to keep all application
and network services performing at
peak operation efficiencies
•Will generate actionable alerts and
alarms to proactively identify areas of
concern.
Origins of Performance Management
•
Network Performance management is a core component of the FCAPS
ISO telecommunications framework (the 'P' stands for Performance in this
acronym). It enables the network engineers to proactively prepare for
degradations in their IT infrastructure and ultimately help the end-user
experience.1
F
• Fault
C
• Configuration
Reference
1- http://en.wikipedia.org/wiki/Network_performance_management
A
• Accounting or
• Administration
P
• Performance
S
• Security
MTTR v.s MTTI
MTTR – Mean Time to Repair – Companies spend a lot of time trying to decrease this.
MTTI – Mean time to Identify - When they should spend just as much time on improving this.
Fault Detected
Outage escalation
Outage bridge engaged and trouble
shooting started
Root cause identified and
Mitigation efforts started
Outage closed and
BAU
MTTI Scenario
React to performance related issues
and mitigate the potential outage
Performance issue detected
No
Length
outage
of experiences
potential issue
Customer experience degraded
performance
MTTI Revisited
•The real world is a lot more complicated.
•Issues appear at any layer in an application
flow
•GSS, load balancers, VM Environments add
to the complicated dynamic monitoring
environment.
•Application performance between each tier is
critical for proactive performance monitoring
Reference http://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCInfra_2.html
Some Silos That Exist
Some version of performance management should exist in ever
silo in order for the enterprise to be adequately covered. Providing
the different operation organization the proactive alarming to head
off outages before they start.
Infrastructure
• CPU
• Memory
• NIC Throughput
• Storage Health
• etc
Application
• JVM Performance
• Thread count
• Sockets
• Transaction throughput
• Transaction Errors
Network
• Latency
• Errors
• Discards
• Utilization
• Top Talkers
• Re-Transmissions
• CPU
• Memory
• Load Balancer
• Concurrent Connection
Rate
• QOS Stats
Database
• Slow database
response time
• Database load issues
• Unpredictable
performance spikes
• Locking problems
• Internal database
contention
Mainframe
• LPAR Status
• CICS Region Health
• IMS Health
• BD2 Health
Focusing on one Silo – (Network)
•
Where to start
– User defined SNMP Polling- Operations / users should provide metrics
of concern and thresholds.
– All polling should have a threshold (ex. warning, major, critical)
associated with it.
– Alarms and events driven off the alarms will be actionable
• Operational Actions
• Capacity Planning Actions
• etc (based on the functional organization layout)
Performance Metrics
These are some performance metrics that should be considered for
collection for performance monitoring
SNMP
–
Polling for performance metrics (cpu, mem, utilization, connection rate, QOS, etc)
Traps/Syslogs
–
Usually generated for fault level events but may be needed for metrics that are not available via SNMP
Netflow
–
Collects traffic summarization stats for correlation
App Monitoring
–
Collects TCP based performance stats (syn, syn-ack, ack, fin, etc)
Synthetic testing
–
Collects transaction response time, tcp connect time, DNS Lookup, etc
Customer Experience Monitoring
–
How many errors are being experience by customer (400, 500, etc)
Custom Scripting
–
For data that can not be generated or collected in from the methods above custom scripts may be needed to collect
required metrics
SNMP
•
•
•
SNMP stands for Simple Network Management Protocol. It is an industry
standard way of monitoring hardware and software, and is supported by
nearly all manufacturers, from Juniper, to Cisco, to Microsoft, Unix, and
everything in between.
SNMP requires two basic components to work: management station(s),
and an agent (device).
SNMP can be centralized or distributed
– Distributed SNMP should be considered because it can provide latency
metrics closer to what the user experience
– Centralized is easier to administer and maintain, but runs the risk of
allowing the location to be a single point of failure.
SNMP Requirements
Things that you should have or consider doing going forward:
-
Should have a complete inventory of MIB OIDs, Thresholds
Define users group or owners for those MIB OIDs
- As new IOS or OS are upgraded new OIDs may be available
and may replace custom scripting (i.e COPP Policy drops on
older version of Nexus OS)
- NMS Admins should become a client services group and rely
on defined processes and policies to get new OID into NMS
and to audit existing monitoring
SNMP Can Become Noise
•
•
•
Do not poll OIDs (individual OID instances or tables) unless it is data that is
usable, actionable and provides value
Some OID tables will generate excessive data (QOS Class Map Stats, Load
Balancer Virtual servers, Firewall NAT Tables)
SNMP can generate noise on the network.
Adding onto SNMP
SNMP will only provide about 70% of the metrics that point to root cause. Valuable evidence will be
missed with SNMP only monitoring
Netflow information will add traffic flow metrics
•Port & Protocol Stats
•Utilization based on client, server, ports, protocols and interfaces.
•Helps to show deviation from BAU for specific servers
Application Monitoring (Probe based) TCP analysis
•Analyzes application based on TCP monitoring
Synthetic Testing is a repetitive test that will constantly validate the following (ex. CISCO IPSLA) :
•Availability, Latency, Data transfer time, etc
•Failures – connection failures, session refused, etc
Customer experience Monitor (Probe, robot, etc) provides exact count for customer experience:
•Failures
•Transactions
•Transaction Times
TRAPS / SYSLOGS
•
On mission critical devices that could change from performance level events
to fault level events in less than a standard performance poll period needs
to generate FAULT Level alerts via SNMP & SYSLOGS in order to put
proper attention before business is impacted.
– Firewalls
– Load Balancers
– DNS
NETFLOW
NetFlow is a network protocol developed by Cisco Systems for collecting IP traffic information. NetFlow has become an
industry standard for traffic monitoring and is supported on various platforms..1
• Analyze new applications and their network impact
Identify new application network loads such as VoIP or remote site additions.
• Reduction in peak WAN traffic
Use NetFlow statistics to measure WAN traffic improvement from application-policy changes; understand who is utilizing the network and the network top talkers.
• Troubleshooting and understanding network pain points
Diagnose slow network performance, bandwidth hogs and bandwidth utilization quickly..
• Detection of unauthorized WAN traffic
Avoid costly upgrades by identifying the applications causing congestion.
• Security and anomaly detection * ( this may not be applicable based on the tools used – some tools only keep top 200 talkers) per port and protocol.
NetFlow can be used for anomaly detection and worm diagnosis along with applications such as Cisco CS-Mars.
• Validation of QoS parameters
Confirm that appropriate bandwidth has been allocated to each Class of Service (CoS) and that no CoS is over- or under-subscribed.
Reference
1http://en.wikipedia.org/wiki/Netflow
2http://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/ios-netflow/prod_white_paper0900aecd80406232.html
2
Netflow – Filling in the Blanks
SNMP Data
IMPACT (felt as latency)
QOS does help with prioritizing the
traffic and preventing drops due to
congestion. Latency is felt heavier on
non-mission critical QOS queues.
Netflow paints inside the lines and
provides information about who
or what caused the issue
Filling in the Gaps
SNMP will provide performance metrics
inside and on the edges of the network
devices
Netflow will provide network stats for the
traffic going across IP Addressed interfaces
or VLANs. Do not over design Netflow and
put the NETFLOW on either side of the same
link, unless necessary.
Vendor Netflow Support
Reference
1http://en.wikipedia.org/wiki/Netflow
Netflow V5 v.s. V9
Version
v1
v2
v3
v4
Comment
First implementation, now obsolete, and restricted to IPv4
(without IP mask and AS Numbers).
Cisco internal version, never released.
Cisco internal version, never released.
Cisco internal version, never released.
v5
Most common version, available (as of 2009) on many routers
from different brands, but restricted to IPv4 flows.
v6
No longer supported by Cisco. Encapsulation information (?).
v7
v8
v9
v10
Like version 5 with a source router field. Used (only?) on Cisco
Catalyst switches.
Several aggregation form, but only for information that is already
present in version 5 records
Template Based, available (as of 2009) on some recent routers.
Mostly used to report flows like IPv6, MPLS, or even plain IPv4
with BGP nexthop.
aka IPFIX, IETF Standardized NetFlow 9 with several extensions
like Enterprise-defined fields types, and variable length fields.
Reference
1http://en.wikipedia.org/wiki/Netflow
•Some devices only support V9
•Newer devices only support
sampling (i.e 1 of 50)
•IPV6 is only supported in V9 and
higher.
AppFlow (what is it)
•
Appflow is a mechanism within the Citrix load balancers that provides more insight than standard netflow.
It send information via IPFIX (Internet Protocol Flow Information eXport), which is an open Internet
Engineering Task Force (IETF) standard defined in RFC 5101.
•
AppFlow provides visibility at the transaction level for HTTP, SSL, TCP, and SSL_TCP flows and will
provide http transaction codes.
•
AppFlow records contain standard NetFlow or IPFIX information, such as
–
–
–
Time stamps for the beginning and end of a flow
Packet count, and byte count
Application-level information
• HTTP URLs, HTTP request methods
• Response-status codes
• Server response time
• Latency
SNMP, Netflow and AppFlow
SNMP will provide performance metrics inside and on the edges of
the network devices
Appflow can provide additional TCP and WEB transaction flow
metrics and client side experience.
Netflow will provide network stats for the traffic going across IP
Addressed interfaces or VLANs. Do not over design Netflow and put
the NETFLOW on either side of the same link, unless necessary.
Application Monitoring
Probe should be placed as close to the servers as
possible. A span or a tap at an aggregation point (such
as a load balancer) is a great place to collect all
application related traffic for analysis.
Port mirroring (SPAN Ports) be cost effective but can
be commandeered an modified by Operations during
troubleshooting. Adding additional ports to the span
session could cause double and triple packets being
seen by the probe
De-duplication devices should be considered if SPAN
sessions are going to be leveraged. Some deduplications devices do not have a large enough buffer
window to accommodate proper de-dupliction.
Application Monitoring
Server Response Time
The time it takes for a server to send an initial response to a client request
or the initial server "think time." Increases in the Server Response Time
generally indicate the following:
•A lack of server resources such as CPU, memory, disk, or I/O
•A poorly written application
Data Transfer Time
The time it takes to transmit a complete response measured from the
initial to final packet. Data Transfer Time excludes the initial server
response time and includes only Network Round Trip Time if there is
more data to send than fits in the TCP window.
Retransmission Delay
The elapsed time between the original packet send and the last duplicate
packet send.
Network Round Trip Time
The time it takes for a packet to travel across the network in both
directions between the client and server.
Total Transaction Time
The time it takes to complete a TCP transaction or data request within a
persistent TCP connection.
Reference
CA Application Delivery Analysis Help Document
Synthetic Testing
– IPSLA - IP SLA (Internet protocol service level agreement) is an
existing feature within certain levels of Cisco Internetwork Operating
System (Cisco IOS) that allows an IT professional to collect information
about network performance in real time.
•
802.1agEcho, 802.1agJitter, dhcp, dns, echo,ftp, http, jitter , pathEcho , pathJitter, tcpConnect ,
udpEcho
– Bots (internal and external) mimicking real users provide a great way to
collect real time performance statistics on an issue.
Customer Experience
•
•
Customer experience is paramount and can make or break a service or
offering.
Some vendors have started to provide either probe based or agent based
monitoring tools that can isolate and identify the true customer experience.
– Transaction time
– Defect Monitoring
– Transaction Error tracking
The Perfect Version of Performance/Fault
Management
Business
Impact
Application
Correlation
Root Cause
Identification
Service Desk
CPU
Memory
NIC Throughput
Storage Health
etc
Perf
JVM Performance
Thread count
Sockets
Transaction throughput
Transaction Errors
Latency
Errors
Discards
Utilization
Top Talkers
Re-Transmissions
CPU
Memory
Load Balancer Concurrent Connection
Rate
• QOS Stats
Slow database response time
Database load issues
Unpredictable performance spikes
Locking problems
Internal database contention
Perf
Mainframe
•
•
•
•
LPAR Status
CICS Region Health
IMS Health
BD2 Health
Testing
Netflow
SNMP
Fault
Database
•
•
•
•
•
Events
Traps
Perf
Syslogs
Testing
Netflow
SNMP
Events
Traps
Fault
Perf
Network
•
•
•
•
•
•
•
•
•
Syslogs
Testing
Netflow
SNMP
Events
Fault
Application
•
•
•
•
•
Syslogs
Traps
Testing
Manager of Manager
Netflow
SNMP
Fault
Infrastructure
•
•
•
•
•
Events
Traps
Perf
Syslogs
Testing
SNMP
Fault
Netflow
Events
Traps
Syslogs
Manager of Manager
Change Related Outage Scenario
Issue = DB platform update caused a reduction in the number
of concurrent socket count available.
SNMP – may report the following:
1. 1st Tier Load Balancer client session count goes up for specific virtual server sessions (VIPs)Concurrent Web server connection count also goes up.
2. Utilization graphs may show a leveling off of outbound traffic from DB Tier compared to past
performance.
3. If the application tier is also load balanced the concurrent client connection and server connection
will also go up for this application flow.
Reason – client connection are taking much longer to satisfy requests.
Netflow/ Appflow
1. May not show any deviation from norm unless application flow was compared between
two time periods. Current flows may show a leveling off at a much lower level.
2. Load balancer appflow may report session time outs or sessions refused
Application Monitoring
1. Comparing application flow performance from tier to tier will pinpoint that the server
response time is high on the first and middle tier and is being caused by DB
enviroment.
If thresholds where set , and the tools are able to alarm based on a
deviation from a known baseline, this issue could be alarmed on before
critical business impact is felt and the issue could be quickly resolved.
Custom Internal Dashboards
Netflow
SNMP
(These graphs are not internally developed)
For basic dashboards to combine two or more data
sources make sure you relationally tie the data
together on non-null based fields.
(i.e. Link dashboard data elements based on Device
IP, and interface Index.)
Meta-Data enrichment of dashboards can usually only
be done through custom scripting
Performance event correlation is very difficult and
requires understanding of application and systems
dependencies. Leverage off the shelf tools for this
level of analysis, but make sure that the tools you buy
have an API, allowing for you to have access to the
data for custom development.
Other Reasons for Custom Scripting
•
•
•
A single golden source should be the common database that feeds all tools
(SNMP Polling, Netflow Collection, etc).
Since tools may not all have the same discovery mechanism and may
require to be provided seed-discovery information differently it will most
likely be that you will need to create some custom scripting to help tie things
together.
If you are creating your own dashboards link data based on Device IP, and
interface Index
Custom Scripting
•
Even though all vendor say a “single pane of glass”, there is really no one vendor that can provide all the bells
and whistles and best of class elements that exist in all the others. Some vendor come close and provide a
method to collapse multiple data collection methods (snmp, Netflow, etc) into a common dashboard.
Q&A
Thanks for the opportunity to present to you