THIS IS MAIN TITLE FOR THE WHOLE PRESENTATION

Download Report

Transcript THIS IS MAIN TITLE FOR THE WHOLE PRESENTATION

perfSONAR Multi-Domain Monitoring Service Deployment and Support: The LHC-OPN Use Case

Fausto Vetter, Domenico Vicinanza DANTE TNC 2010, Vilnius, 2 June 2010

connect • communicate • collaborate

Agenda

Large Hadron Collider Optical Private Network (LHC-OPN) Multi-Domain monitoring challenge: perfSONAR GÉANT Multi Domain Monitoring Service GÉANT Service Desk The LHCOPN case: Deployment Support Monitoring

connect • communicate • collaborate

LHC-OPN

Large Hadron Collider – Optical Private Network (LHC-OPN): Dedicated network to support LHC experiment Large amount of data in a grid environment Network architecture is organized in

Tiers

1 Tier0, 11 Tier1, 140+ Tier2

Primary users are researchers around different institutes

Requirement:

Large amount of data being exchanged

Strategy:

Keep traffic segregated from Internet

Solution:

Optical Private Network (LHC-OPN) among Tier 0/1s

Challenge:

monitoring effectively in a

multi-domain

environment

connect • communicate • collaborate

LHC-OPN Topology

Dual-star topology 10 Gb/s links Cross-border fibers resiliency Multi-domain

LHC-OPN Topology connect • communicate • collaborate

Monitoring the LHC-OPN: The requirements

Focus of monitoring: Network Layer (IP) Physical Layer (Links) Regular Active Point-to-Point Measurements One-Way Delay, One-Way Delay Variation, Achievable Bandwidth, Historical Traceroute Changes Regular Passive Point-to-Point Measurements Utilization, Input Errors, Packet Discards End-to-End link monitoring Managed service Unified view of the network status and information across all sites Homogeneous installations and centralized operations

connect • communicate • collaborate

Monitoring the LHC-OPN: The solution - perfSONAR

The Tool:

perfSONAR GÉANT multi-domain monitoring (MDM) tool Based on Open Grid Forum Standard Monitoring Protocol Customized, fully managed and supported for LHCOPN

Objective:

Identify network problems

across multiple domains

Correctly, efficiently and quickly

Allowing proactive actions

Strategy:

perform network monitoring actions in different network domains make the information available thanks to a common protocol –

cross-domain monitoring capability

access network performance metrics from across multiple domains

connect • communicate • collaborate

perfSONAR as unifying layer across domains

Domain 1 Domain 2 Each domain has its own local monitoring Domain 3 perfSONAR Services perfSONAR Domain 4 perfSONAR UI (visualization) Scripts/API connect • communicate • collaborate

Monitoring the LHC-OPN: The benefits

Effective monitoring across the several LHC-OPN domains

perfSONAR enables multi-domain monitoring

Problems can be tracked through the participating domains from a single interface

…proactively solving problems across domains

Effective, distributed monitoring can identify problems even before users suffer them

… through a customized web portal

Monitoring portal designed according to LHCOPN needs

Easy to integrate into involved NOCs workflows

Less disruptions and faster recovery

Easy to take and foster collaborative efforts

Fully managed solution: Low overhead for the Tier0/1 network operators involved Configuration, Operation and Support carried out by GÉANT SD

connect • communicate • collaborate

perfSONAR at LHC-OPN

12 sites (1 Tier0, CERN, and 11 Tier1) involved Several Countries around Europe, Asia and America Access to network measurements data from multiple network domains Customized version of perfSONAR MDM service for Tier0/1 sites (main contributor to LHCOPN operations) Customized visualization tool accessed: Dedicated web portal Specific weather maps and further diagnosis tools to visualize measurements results Monitoring tools, hardware and operating system packed in monitoring boxes, To be easily deployed at any location Remotely accessible by the service desk for operations and support

connect • communicate • collaborate

GÉANT MDM Service Design for LHCOPN

Two servers installed in each site (Tier0 and Tier1) : Server 1 (HADES): –

one way delay, one way delay variation, achievable bandwidth, historical traceroute changes

Server 2 (MDM): –

regular passive measurements carried out for collecting interface utilisation, input error and packet discards statistics from the sites network elements

Each site provided: Gigabit port on the border router Switch Time Sources DNS Servers

connect • communicate • collaborate

perfSONAR MDM in LHC-OPN

L2 MP LHC-1 (HADES)

Tier-1 (CNAF-IT)

LHC-2 (MDM)

Tier-1 (GRIDKA-DE)

L2 MP LHC-2 (MDM) LHC-1 (HADES)

Tier-1 (IN2P3-FR)

LHC-1 (HADES) L2 MP LHC-2 (MDM)

Tier-1 (RAL-UK)

LHC-1 (HADES) L2 MP LHC-2 (MDM) LHC-1 (HADES) L2 MP LHC-2 (MDM)

Tier-1 (SARA-NL)

HADES Central Server LHC-1 (HADES)

Tier-1 (NDGF-DK)

LHC-2 (MDM)

Tier-0

L2 MP

(CERN-CH)

LHC-1 (HADES) LHC-2 (MDM) L2 MP LHC-1 (HADES)

Tier-1 (PIC-ES)

L2 MP LHC-2 (MDM)

Tier-1 (TRIUMF-CN)

L2 MP LHC-2 (MDM) LHC-1 (HADES)

Tier-1 (FNAL-US)

L2 MP LHC-2 (MDM) LHC-1 (HADES)

Tier-1 (BNL-US) Tier-1 (ASGC-KR)

LHC-1 (HADES) LHC-2 (MDM) L2 MP Management Network Monitoring Devices RHEL Network Application Repositories Visualization Network CFEngine System CNM perfSONAR-UI Visualization Tools OWAMP Pinger BWCTL Tier 2 OWAMP, Pinger & BWCTL Probes LHC-2 (MDM) LHC-1 (HADES) L2 MP

connect • communicate • collaborate

The result as displayed by the LHC-OPN Portal

connect • communicate • collaborate

Weather-map E2Emon Link Status

connect • communicate • collaborate

Weather-map E2Emon Link Status

connect • communicate • collaborate

GÉANT Application Service Desk

Deployment carried out by the GÉANT Application Service Desk Dedicated Staff Manage the Users Relationship Responsible for Incident Management Interact with Problem Management/Product Management to Improve Products Acts as a Single Point of Contact: Usage of Products Deployment of Products Debugging Issues on Products Focus on transition and operations of the services delivered

connect • communicate • collaborate

GÉANT MDM Service Transition

Service deployment: two workflows Server 1: OS and Software installed and configured by a GÉANT partner Server 2: OS and Software entirely installed and configured remotely Phase details:

Pre-Shipment:

– – –

Pre-Shipment Form Shipment:

gather information about deployment details

servers shipment to GÉANT partner and customer Receive Boxes: customer and configuration partner receives boxes

Preparation:

– – –

Pre-Deployment Form

Third party supplier prepares servers

Physical Installation

Deployment:

software installation

Configuration:

service configuration

Validation connect • communicate • collaborate

MDM Service Deployment Agenda

connect • communicate • collaborate

perfSONAR services monitoring

Service Monitoring Infrastructure (based on Nagios+Cacti): Customised set of testing scripts and health checks 35 Checks per server, covering hardware, software and services Automatic notification, detailed history Three layer monitoring:

Hardware layer

: CPU, MEM, disk space, network interfaces, TCP/UDP traffic, temperature

Resource layer

: login attempts, Tomcat RRT, eXist RTT, MySQL, NTP

Service layer

: perfSONAR services availability and performance Additional tools:

Syslog

server (with MySQL support) security

log auditing

(with automatic email report tools)

connect • communicate • collaborate

GÉANT MDM Service Operations: the monitoring interfaces

connect • communicate • collaborate

GÉANT MDM Service Operations: incident management procedures

Well defined procedures for Incident Management: Raise Issue Notify Reporter Service Desk (Ticket Owner) Open Ticket Assign Responsible Object1 Responsible Require Information Information Identify Ticket Handle Ticket Close Ticket Notify Ticket Closing Incident Management Notify Reporter Object1 Responsible Reporter Third Party Supplier Handle Incident/Information Report Problem Solution Document Solution Notify Reporter Third party supplier involved

connect • communicate • collaborate

Conclusions

GÉANT Application Service Desk: Effective single point of contact in complex deployments LHC-OPN use case: great opportunity for service & support infrastructure Reasons for a successful deployment: Preparation phase is crucial Adequate tools for event and incident management Customer collaboration was the main player on the deployment.

Continuous service improvement Periodic meetings with involved parties Quality audits about the deployment

connect • communicate • collaborate

Final Remarks

Thanks to: perfSONAR community GÉANT partners DANTE perfSONAR development team CERN and its partners Thanks for your attention Any questions and/or comments?

connect • communicate • collaborate