NGOP Overview Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab People  Integrated Systems Development Department        Operating Systems Support Dept.      Don Petravick Krzysztof Genser Jim Fromm Tanya Levshina Igor Mandrichenko Terry.

Download Report

Transcript NGOP Overview Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab People  Integrated Systems Development Department        Operating Systems Support Dept.      Don Petravick Krzysztof Genser Jim Fromm Tanya Levshina Igor Mandrichenko Terry.

NGOP Overview
Jim Fromm
Farms and Clustered Systems Group
Computing Division
Fermilab
1
People

Integrated Systems Development Department







Operating Systems Support Dept.





Don Petravick
Krzysztof Genser
Jim Fromm
Tanya Levshina
Igor Mandrichenko
Terry Jones
Troy Dawson
Lisa Giachetti
Ken Schumacher
Marc Mengel
Computing Services Dept.



Jeff Mack
Rick Thies
Rich Thompson
November 2, 2000
http://www-isd.fnal.gov/ngop
2
Goals





NGOP working group charged with the task of developing a
Distributed Management System (DMS) that would scale to
the anticipated requirements for Run II farms.
Future size of farms require that the DMS be pro-active. The
system should take corrective action when possible.
Must detect hardware, system, and application problems.
Problem diagnostics should eliminate “noise”, or false alarms.
Should provide tools to do performance analysis.
November 2, 2000
http://www-isd.fnal.gov/ngop
3
NGOP History





Summer 1999:
Sept 1999:
Dec 1999:
Jan 2000:
Today:
November 2, 2000
NGOP group created to gather
requirements for a Distributed
Management System capable of
efficiently monitoring Fermilab computing
facility for Run II.
Requirement gathering completed.
Evaluation of available products
presented.
Decision to develop a custom DMS made
Development of prototype underway.
Completion is expected before year end.
http://www-isd.fnal.gov/ngop
4
We are not alone…



As computer farms get larger, other HEP sites are looking at a
similar problem
March 2000, CERN and BNL visited Fermilab to exchange
ideas on lessons learned. SLAC, JLAB, and IN2P3 participated
via video conference.
July 2000 Fermilab visited CERN to follow up on the March
meetings.
November 2, 2000
http://www-isd.fnal.gov/ngop
5
Some Terminology

Monitored Object is one of the following:










Host: A computer identified by it’s full domain name
Cluster: A collection of hosts
Component: An atomic element that has a well defined
behavior.
System: A collection of components.
Condition: A pre-defined state of an Monitored
Object.
Event: A description of a detected condition.
Action: An activity initiated by the NGOP system
based on an event.
Alarm: An asynchronous indicator initiated by NGOP.
Status: Shows the level of the monitored element
“functionality”.
Monitoring Agent: A software component that
generates events based on conditions and performs
actions.
November 2, 2000
http://www-isd.fnal.gov/ngop
6
NGOP Requirements

Essential Features

Should detect hardware, network, system, and application errors.















System Daemon status (inetd, mbatchd)
Unreachable hosts.
Security breaches
/tmp full.
Should run on all Fermilab supported operating systems.
Scalable to 1000s of hosts.
Must be multi-user, must support different authorization levels.
Provide an interface for user written monitoring tools.
Generate different levels of alarms (Warning,Info, etc…).
Perform actions based on alarms and events (email,page,restart daemon).
Provide a hierarchical view of the monitored system.
Dynamic configuration.
Provide monitoring capabilities via a web browser, GUI, and command line
interface.
Provide special states for monitored objects such as “known bad”.
Desirable Features:



Ability to have overlapping clusters.
Ability to generate reports based on selection criteria.
Implement step by step notification of performed actions.
November 2, 2000
http://www-isd.fnal.gov/ngop
7
Products Evaluation

Some Evaluated Products:

Patrol






Tkined/Scotty






No notion of hierarchy or clusters.
Web and “GUI”(curses) interface have limited customization.
Very limited filtering of events
Netlogger





Not scaleable for multiple users
System monitored only while GUI running
Only one level of alarms
Nocol


Not scalable for centralized monitoring
One level of hierarchy
No overlapping clusters
No filtering of events
No GUI/UI
Limited off-shelf functionality
No customization for monitoring agents
Very limited way to create hierarchy.
Requires too much knowledge of underlying system to detect a problem.
Misc Commercial Products



November 2, 2000
Complex
Did not meet requirements
Very expensive, both in terms of licensing and setup costs.
http://www-isd.fnal.gov/ngop
8
Product Evaluation Summary




Many commercial and open-source products try to solve the
problem in many different ways.
None of the evaluated products met the basic requirements at
Fermilab.
Discussion with others who chose the commercial route were
not encouraging. Many bad experiences documented.
Decision was made to develop our own custom DMS.
November 2, 2000
http://www-isd.fnal.gov/ngop
9
Design Summary – Key System Components






Monitoring Agent:Monitors a monitored object,generates
events based on certain conditions.
Sensor Agent: Similar to a monitoring agent, but this process
collects performance data and generates events at a higher
rate than a monitoring agent.
NGOP Central Server(NCS): The central daemon process that
gathers events from MA’s, provides users with requested
information, and dumps persistent data into the Archive
Server.
NGOP Configuration File Management Service: Provides a
mechanism to centrally locate system configuration and rules.
Allows for dynamic reconfiguration of system.
Archive Server: daemon that handles archive storage.
Provides a means to write, read, and query the data.
Monitoring Client: Communicate with NCS using an API to
display system status in a meaningful manner.
November 2, 2000
http://www-isd.fnal.gov/ngop
10
NGOP Architecture
Cluster A
Report
Generator
Archive
Service
MA
Archive
MA
Monitor
Administrator
MA
Central Server
Configuraton
File Management
Service
Persistent
Config.Data
Cluster B
Cluster B1
MA
MA
MA
s
Monitored Objects
Action
Client
s
S
Host
Element
Cluster
System
NGOP Components
s Sensor Agent
MA
MA
MA
s
s
Data
Analyzer
Router
s
Server
Monitoring Agent
Monitoring
Data Storage
Clients
s
Connections
Cluster B2
November 2, 2000
Performance
Storage
Service
TCP
UDP
Performance
Data
http://www-isd.fnal.gov/ngop
connection between
Monitored Element
and MA
Not implemented in prototype yet
11
Monitoring Agents – The hook into NGOP





The monitoring agents (MA) is the process that monitors an object,
and generates events when a condition is met. A message describing
this event is sent to the NGOP Central Server (NCS).
NGOP defines the protocol to exchange information with the central
server.
A set of basic MA’s will be deployed with the NGOP system, users are
free to write their own.
An API(C,C++,Perl,Python) will be provided to allow for development
of MA’s.
MA’s should send info to the NCS when:




When current characteristics of a monitored object meet a condition.
When the condition is no longer satisfied.
Heartbeat messages sent periodically to let the NCS know it is still alive.
Examples:


Monitor whether or not a batch system is running.
Monitor the size of a file system, issuing alarms when it is 90% full.
November 2, 2000
http://www-isd.fnal.gov/ngop
12
Sensor Agents



Sensor Agents send performance data to the Performance
Storage Service.
The rate of this data is expected to be much higher than that
of the MA’s.
Examples:


Monitor the temperature of a computer every second.
Monitor the CPU utilization continuously.
November 2, 2000
http://www-isd.fnal.gov/ngop
13
NGOP Central Server




NCS is the process that gets messages sent from MA’s, stores
them via the Archive Server, and provides monitoring clients
(GUI for example) requested information.
One instance of the NCS will be running in the system.
NCS must handle many (10,000+) MA’s, and ~ 50 clients.
NCS should




Update object characteristics when MA reports a change.
Determine if an MA is dead, and forward this info along to the
relevant monitoring client.
Forward event and action messages to the Archive Server.
Forward event messages to subscribed monitoring clients.
November 2, 2000
http://www-isd.fnal.gov/ngop
14
NGOP Configuration File Management Service





Responsible for providing a central repository for system
configuration and monitoring rules.
Allows for dynamic reconfiguration of the system.
Configuration files written in xml.
Central repository is implemented using CVS in the prototype.
Only authorized users can update.
November 2, 2000
http://www-isd.fnal.gov/ngop
15
Rules






Rules define the status and the alarm level associated with
monitored objects.
Rules describe the condition that should be satisfied in order
for a monitored object to have status and alarm level.
Master rules are stored in the Configuration File Management
Service (CFMS).
Users can create their own rules and store them locally. Users
with permission can store these rules in the CFMS.
Dependency rules are a mechanism to filter out noise. For
example, a batch system can be dependent on the power
supply. If the power goes out on a machine, the fact that the
batch system is down will not be raised.
Alarm/Action rules define the condition that will cause an
alarm/action to be performed.
November 2, 2000
http://www-isd.fnal.gov/ngop
16
Monitoring Clients



Monitoring clients will be developed with an API that allows
determination of the status of each node in a hierarchy, based
on rules and current information obtained from the NCS.
Monitoring clients will initiate action requests.
Monitoring clients determine the state of the system and
monitored elements based on information gathered from the
NCS.
November 2, 2000
http://www-isd.fnal.gov/ngop
17
Archiver/Performance Storage Service


The Archive/Performance Storage Service(PSS) is responsible
for storing and retrieving messages generated by the NGOP
system. These messages represent event, sensor, or action
data.
Components:









Archive Server
Archive Retriever
Performance Storage Subsystem(PSS)
PSS Retriever
Archive Database Interface
Database (Oracle).
DBArchiver
The PSS is simply another instance of the Archive Server.
Performance data will need to be consolidated.
November 2, 2000
http://www-isd.fnal.gov/ngop
18
NGOP Prototype
NGOP prototype development is currently underway. The
prototype consists of the following modules:

NGOP Central Server

Configuration File Management Service

Monitoring Agents:




NGOP Client API


Determines the status of the each monitored elements based on predefined rules and current information received from the NGOP Central
Server
NGOP Monitor




OS Health: Monitors specific system daemons, file system existence and
size, CPU load, and free memory.
Ping Agent: Monitors node reachability
FBSNG Agent: Monitors the FBSNG batch system.
Graphical representation of monitored elements status.
Provides means to see and acknowledge occurred events and alarms
Provides limited configuration options
Archive Server


Stores event and action messages to local disk.
The Archive Database Interface moves the message from local disk to an
Oracle database.
November 2, 2000
http://www-isd.fnal.gov/ngop
19
NGOP Monitor
Alarm:
Status:
Bad
Warning
Event description
Good
Undefined
November 2, 2000
http://www-isd.fnal.gov/ngop
20
NGOP Monitor
(event acknowledgment, known-status modification…)
Monitored Element Info:
November 2, 2000
http://www-isd.fnal.gov/ngop
21
NGOP Monitor (Configuration Options)
Default icons for known object types
Default colors for
status representation:
Selecting elements
for top level display:
November 2, 2000
http://www-isd.fnal.gov/ngop
22
:
Summary




Building a DMS is a complex problem.
Various commercial and open source systems were analyzed.
None met the basic requirements for the NGOP project at
Fermilab.
Prototype system is under development.
See http://www-isd.fnal.gov/ngop for project details.
November 2, 2000
http://www-isd.fnal.gov/ngop
23