Transcript THIS IS MAIN TITLE FOR THE WHOLE PRESENTATION
perfSONAR Multi-Domain Monitoring Service Deployment and Support: The LHC-OPN Use Case
Fausto Vetter, Domenico Vicinanza DANTE TNC 2010, Vilnius, 2 June 2010
connect • communicate • collaborate
Agenda
Large Hadron Collider Optical Private Network (LHC-OPN) Multi-Domain monitoring challenge: perfSONAR GÉANT Multi Domain Monitoring Service GÉANT Service Desk The LHCOPN case: Deployment Support Monitoring
connect • communicate • collaborate
LHC-OPN
Large Hadron Collider – Optical Private Network (LHC-OPN): Dedicated network to support LHC experiment Large amount of data in a grid environment Network architecture is organized in
Tiers
–
1 Tier0, 11 Tier1, 140+ Tier2
Primary users are researchers around different institutes
Requirement:
Large amount of data being exchanged
Strategy:
Keep traffic segregated from Internet
Solution:
Optical Private Network (LHC-OPN) among Tier 0/1s
Challenge:
monitoring effectively in a
multi-domain
environment
connect • communicate • collaborate
LHC-OPN Topology
Dual-star topology 10 Gb/s links Cross-border fibers resiliency Multi-domain
LHC-OPN Topology connect • communicate • collaborate
Monitoring the LHC-OPN: The requirements
Focus of monitoring: Network Layer (IP) Physical Layer (Links) Regular Active Point-to-Point Measurements One-Way Delay, One-Way Delay Variation, Achievable Bandwidth, Historical Traceroute Changes Regular Passive Point-to-Point Measurements Utilization, Input Errors, Packet Discards End-to-End link monitoring Managed service Unified view of the network status and information across all sites Homogeneous installations and centralized operations
connect • communicate • collaborate
Monitoring the LHC-OPN: The solution - perfSONAR
The Tool:
perfSONAR GÉANT multi-domain monitoring (MDM) tool Based on Open Grid Forum Standard Monitoring Protocol Customized, fully managed and supported for LHCOPN
Objective:
Identify network problems
across multiple domains
–
Correctly, efficiently and quickly
–
Allowing proactive actions
Strategy:
perform network monitoring actions in different network domains make the information available thanks to a common protocol –
cross-domain monitoring capability
–
access network performance metrics from across multiple domains
connect • communicate • collaborate
perfSONAR as unifying layer across domains
Domain 1 Domain 2 Each domain has its own local monitoring Domain 3 perfSONAR Services perfSONAR Domain 4 perfSONAR UI (visualization) Scripts/API connect • communicate • collaborate
Monitoring the LHC-OPN: The benefits
Effective monitoring across the several LHC-OPN domains
perfSONAR enables multi-domain monitoring
–
Problems can be tracked through the participating domains from a single interface
…proactively solving problems across domains
–
Effective, distributed monitoring can identify problems even before users suffer them
… through a customized web portal
–
Monitoring portal designed according to LHCOPN needs
–
Easy to integrate into involved NOCs workflows
–
Less disruptions and faster recovery
–
Easy to take and foster collaborative efforts
Fully managed solution: Low overhead for the Tier0/1 network operators involved Configuration, Operation and Support carried out by GÉANT SD
connect • communicate • collaborate
perfSONAR at LHC-OPN
12 sites (1 Tier0, CERN, and 11 Tier1) involved Several Countries around Europe, Asia and America Access to network measurements data from multiple network domains Customized version of perfSONAR MDM service for Tier0/1 sites (main contributor to LHCOPN operations) Customized visualization tool accessed: Dedicated web portal Specific weather maps and further diagnosis tools to visualize measurements results Monitoring tools, hardware and operating system packed in monitoring boxes, To be easily deployed at any location Remotely accessible by the service desk for operations and support
connect • communicate • collaborate
GÉANT MDM Service Design for LHCOPN
Two servers installed in each site (Tier0 and Tier1) : Server 1 (HADES): –
one way delay, one way delay variation, achievable bandwidth, historical traceroute changes
Server 2 (MDM): –
regular passive measurements carried out for collecting interface utilisation, input error and packet discards statistics from the sites network elements
Each site provided: Gigabit port on the border router Switch Time Sources DNS Servers
connect • communicate • collaborate
perfSONAR MDM in LHC-OPN
L2 MP LHC-1 (HADES)
Tier-1 (CNAF-IT)
LHC-2 (MDM)
Tier-1 (GRIDKA-DE)
L2 MP LHC-2 (MDM) LHC-1 (HADES)
Tier-1 (IN2P3-FR)
LHC-1 (HADES) L2 MP LHC-2 (MDM)
Tier-1 (RAL-UK)
LHC-1 (HADES) L2 MP LHC-2 (MDM) LHC-1 (HADES) L2 MP LHC-2 (MDM)
Tier-1 (SARA-NL)
HADES Central Server LHC-1 (HADES)
Tier-1 (NDGF-DK)
LHC-2 (MDM)
Tier-0
L2 MP
(CERN-CH)
LHC-1 (HADES) LHC-2 (MDM) L2 MP LHC-1 (HADES)
Tier-1 (PIC-ES)
L2 MP LHC-2 (MDM)
Tier-1 (TRIUMF-CN)
L2 MP LHC-2 (MDM) LHC-1 (HADES)
Tier-1 (FNAL-US)
L2 MP LHC-2 (MDM) LHC-1 (HADES)
Tier-1 (BNL-US) Tier-1 (ASGC-KR)
LHC-1 (HADES) LHC-2 (MDM) L2 MP Management Network Monitoring Devices RHEL Network Application Repositories Visualization Network CFEngine System CNM perfSONAR-UI Visualization Tools OWAMP Pinger BWCTL Tier 2 OWAMP, Pinger & BWCTL Probes LHC-2 (MDM) LHC-1 (HADES) L2 MP
connect • communicate • collaborate
The result as displayed by the LHC-OPN Portal
connect • communicate • collaborate
Weather-map E2Emon Link Status
connect • communicate • collaborate
Weather-map E2Emon Link Status
connect • communicate • collaborate
GÉANT Application Service Desk
Deployment carried out by the GÉANT Application Service Desk Dedicated Staff Manage the Users Relationship Responsible for Incident Management Interact with Problem Management/Product Management to Improve Products Acts as a Single Point of Contact: Usage of Products Deployment of Products Debugging Issues on Products Focus on transition and operations of the services delivered
connect • communicate • collaborate
GÉANT MDM Service Transition
Service deployment: two workflows Server 1: OS and Software installed and configured by a GÉANT partner Server 2: OS and Software entirely installed and configured remotely Phase details:
Pre-Shipment:
– – –
Pre-Shipment Form Shipment:
gather information about deployment details
servers shipment to GÉANT partner and customer Receive Boxes: customer and configuration partner receives boxes
Preparation:
– – –
Pre-Deployment Form
Third party supplier prepares servers
Physical Installation
Deployment:
software installation
Configuration:
service configuration
Validation connect • communicate • collaborate
MDM Service Deployment Agenda
connect • communicate • collaborate
perfSONAR services monitoring
Service Monitoring Infrastructure (based on Nagios+Cacti): Customised set of testing scripts and health checks 35 Checks per server, covering hardware, software and services Automatic notification, detailed history Three layer monitoring:
Hardware layer
: CPU, MEM, disk space, network interfaces, TCP/UDP traffic, temperature
Resource layer
: login attempts, Tomcat RRT, eXist RTT, MySQL, NTP
Service layer
: perfSONAR services availability and performance Additional tools:
Syslog
server (with MySQL support) security
log auditing
(with automatic email report tools)
connect • communicate • collaborate
GÉANT MDM Service Operations: the monitoring interfaces
connect • communicate • collaborate
GÉANT MDM Service Operations: incident management procedures
Well defined procedures for Incident Management: Raise Issue Notify Reporter Service Desk (Ticket Owner) Open Ticket Assign Responsible Object1 Responsible Require Information Information Identify Ticket Handle Ticket Close Ticket Notify Ticket Closing Incident Management Notify Reporter Object1 Responsible Reporter Third Party Supplier Handle Incident/Information Report Problem Solution Document Solution Notify Reporter Third party supplier involved
connect • communicate • collaborate
Conclusions
GÉANT Application Service Desk: Effective single point of contact in complex deployments LHC-OPN use case: great opportunity for service & support infrastructure Reasons for a successful deployment: Preparation phase is crucial Adequate tools for event and incident management Customer collaboration was the main player on the deployment.
Continuous service improvement Periodic meetings with involved parties Quality audits about the deployment
connect • communicate • collaborate
Final Remarks
Thanks to: perfSONAR community GÉANT partners DANTE perfSONAR development team CERN and its partners Thanks for your attention Any questions and/or comments?
connect • communicate • collaborate