Monitoring Blackboard - Presenter: Volker Kleinschmidt

Download Report

Transcript Monitoring Blackboard - Presenter: Volker Kleinschmidt

Monitoring Blackboard
-
Presenter:
Volker Kleinschmidt
Blackboard Client Support
© Blackboard, Inc. All rights reserved.
Session Abstract
»
Monitoring Blackboard
»
As your Blackboard system becomes more and
more mission critical, the need to monitor its
availability and performance increases.
Collecting such data over time allows easier
troubleshooting and problem isolation. This
session will present several approaches for
monitoring whether and how well key areas of
Blackboard are performing
About Forward-Looking Statements
»
We may make statements regarding our product
development and service offering initiatives,
including the content of future product
upgrades, updates or functionality in
development. While such statements represent
our current intentions, they may be modified,
delayed or abandoned without prior notice and
there is no assurance that such offering,
upgrades, updates or functionality will become
available unless and until they have been made
generally available to our customers.
Quote
"It isn't a service if it isn't
monitored. If there is no
monitoring then you're just
running software."
-- Tom Limoncelli
Types of Monitoring
»Liveness
»Status
Monitoring
Monitoring
»Time Series Monitoring
»Performance Monitoring
»Predictive Analysis
© Blackboard, Inc. All rights reserved.
Liveness Monitoring
Is the server or service up?
Most basic and simple form of monitoring
»
»
»
»
»
»
ping to network interface
database: tnsping (best done from appserver)
webserver: GET /nodatabase.html
tomcat: GET /webapps/login
modperl/PerlEx: GET /bin/button_gallery.pl
collab: telnet to ports 8010, 8011, 8443 (if ssl)
Nagios status
page for BB
Client Support
Test Lab
Status Monitoring
Is the system functioning normally?
Current values of common performance parameters
» system load
» CPU
» memory free/used/swap
» disk I/O
» network I/O
» disk free space
Send alerts when thresholds exceeded
UIC (Illinois
Chicago)
internal system
monitor
Status Monitoring
Is the application functioning normally?
Can users login successfully?
» POST to /webapps/login/ (use known credentials)
Do typical tasks take reasonable time?
» execute timed test script via cmdline browser
» load portal for a known user
» visit and browse through a known course
» visit typical applications: forums, quiz, gradebook
» setup can take significant work – share scripts and
test course in user community to distribute effort
Time Series Monitoring
Trend: how well is server handling its load?
» Domain of graphing tools such as RRDtool
»
»
»
»
»
»
measure various usage and load factors
can measure resulting performance
determine system usage + server health at once
provides basis for administrative usage reports
allows after-the-fact analysis of problems
should be combined with status monitoring
efforts for threshold-based warnings
UAA (Alaska Anchorage) BB Dashboard
Clemson’s Ganglia Dashboard
Status vs. Performance
»
»
»
»
»
»
»
Status Monitoring has qualitative focus
Can certain operations be performed within
reasonable time, below warning threshold?
If not, trigger alert in monitor tool
Performance M. has quantitative focus
Just how long did operation X take?
Based on time series monitoring
Rarely necessary for ongoing operations
Predictive Analysis
Will the server still handle its load next term?
» Requires availability of historic data
» Must use grapher tool + keep snapshots
» Identify and plan for worst-case scenarios
» Historic load patterns allow predicting
future demand – but factor in changes in
policy, adoption, usage patterns
Monitoring Tools
Monitoring Host
Grapher
Data provider/gatherer
© Blackboard, Inc. All rights reserved.
Host applications
»
»
»
»
»
Wealth of commercial and free offerings
SNMP is king for liveness/status monitors
But reportable data is quite restricted
Graphers rely on any type of numeric data,
provided on time interval basis by an SNMP
agent or by a data gathering script in a file
Many/most graphers based on free RRDtool
Monitoring Platforms
Popular liveness & status monitoring hosts:
» Big Brother (& Big Sister)
free, open source
» mon
» Nagios
» HP OpenView
commercial
» SysOrb
» WhatsUp
Graphing tools
»
Based on RRDtool (Tobias Oetiker, ETH)
»
»
»
Round Robin Database stores time-series data,
e.g. current CPU load average, measured
every 5 minutes – this is a “data source”
MRTG, cricket, orca, cacti, Bronc, Munin
Zabbix, Airwave, Big Sister, Torrus, NISCA
Similarities...
RRDtool under the hood
Orca
Munin
The Differences
»
»
»
»
»
»
Setup and Configuration - hard or terrible?
Web Interface quality, navigability
Configurability (e.g. time intervals)
SNMP support built-in?
What pollers / data gatherers provided?
Ease of writing plugins / gatherers
Data Gathering
service parameters
database parameters
system usage parameters
© Blackboard, Inc. All rights reserved.
Getting the Data to Report
»
»
»
»
SNMP defines a set of MIBs – agents are
pre-compiled to report these
mon etc. come with their own set of things
to monitor (e.g. vmstat output)
everything else needs to be gathered by
scripts you write
nobody said this was easy
The Concept of Data Sources
»
»
»
Each data source is a numeric entity with
its own range of possible values, units,
critical values, name, description, color...
Some data sources come from SNMP
agents, others can be put into NFSmounted files by remote jobs
Object tree: machine > service > data
Data Collector
»
»
»
»
Single cron job scheduled every 5min
Fires off various data collection tasks, e.g.
vmstat, apache status, df
Simple numbers (e.g. apache total hits)
require post-processing to be useful
e.g. keep two files (.last, .prev), calculate
and report difference to get interval value
Sample Service Parameters
»
»
»
»
»
»
HTTP requests (hits) – cumulative/current
bytes transferred
number of threads/processes
number of active processes
apache reports all this and more via
server-status?auto (KB 181-2560)
IIS has Perfmon counters for these
Example: tomcat thread count
BBDIR=/usr/local/blackboard
» TCPID=`cat $BBDIR/logs/pid-files/tomcat.pid`
» Solaris: ps –Lp $TCPID | wc –l |sed ‘s/ //g’
» Linux: ps –eo pid,ppid | grep $TCPID | ...
»
or: pstree –p $TCPID | ...
» RHEL3: ps –emo pid,ppid | grep $TCPID | ...
» Windows: pv java* -l”*tmpdir*” –o “%t”
(find TCPID via pv also: pv java* -l”*tmpdir*” -o”%i”)
» pv = freely downloadable cmdline Process Viewer
(use pv –h to find out about invocation details)
»
Multiple data sources, one file
curl http://localhost/server-status?auto 2>/dev/null | \
head -9 | cut –d: -f2 | cut –b2- >apachestats
»15
»20
».00847458
»354
».0423729
»57.8531
»1365.33
»1
»9
Need to configure your graphs with legends, correct intervals
etc. to know what these all mean
Sample database parameters
»
»
»
»
»
free space per tablespace
number of current DB sessions
number of executions of top V$SQL items
too many reportable things to list
some of these parameters change rarely,
so don’t query them that often
Sample system usage parameters
»
»
»
»
»
# current authenticated sessions (from DB)
(Seneca WhosOnline auto-script)
# logins in last 5 mins (from webserver log)
# current chat participants & courses
# currently active quiz attempts (not!)
number of courses with >10 forum posts
(expensive queries can be run once daily)
A word of warning
»
»
»
»
Beware the Heisenberg-principle
Avoid resource-intensive measurements
(don’t count hits in activity_accumulator)
Avoid over-monitoring – reports that
nobody reads are a great waste
Fully document/label your reports, or they
are useless to anyone but you
Community contributions?
»
Ideally we could build a collection of usercontributed monitoring scripts and tools
Listserv is too ephemeral
Blackboard Community site?
Lots of work to be done!
»
Interest in a possible consulting offering?
»
»
»
Further Info
»
»
»
»
The mother of monitoring links:
http://slac.stanford.edu/xorg/nmtf/
nmtf-tools.html
John Sellens’ Monitoring page:
http://www.generalconcepts.com/
resources/monitoring/