A three year thorough review of the ENOC

Download Report

Transcript A three year thorough review of the ENOC

Enabling Grids for E-sciencE
A three years thorough review of a
project’s NOC: the EGEE Network
Operating Centre (
)
Guillaume Cessieux (CNRS/IN2P3-CC, EGEE-SA2)
Xavier Jeannin (CNRS/UREC, EGEE-SA2)
TNC 2009, Málaga, 2009-06-08
www.eu-egee.org
EGEE-III INFSO-RI-222667
EGEE and gLite are registered trademarks
Outline
Enabling Grids for E-sciencE
• EGEE in a very small nutshell
– Overview
– Network involved
– Networking support activity in EGEE
• ENOC
– Concept
– History
– Detailed implementation
• Achievements & review
• Current areas of work
GCX
TNC 2009 – Málaga – 2009-06-08
2
The EGEE project
Enabling Grids for E-sciencE
• Enabling Grids for E-sciencE (EGEE)
– The largest multi-disciplinary Grid infrastructure in the world
– Brings together more than 140 institutions
– Produces a reliable and scalable computing resource
LHC
~ 300 sites (50 countries)
>
>
>
>
80,000 CPUs cores
20 PetaBytes
14,000 users
370,000 jobs/day
• Networks are the key underlying layer
GCX
TNC 2009 – Málaga – 2009-06-08
3
Networks involved
Enabling Grids for E-sciencE
• Multi-domains shared network infrastructure
– Delivered by more than 30 NRENs & GÉANT2
– Including non European (CA, RU, SN, TW, US...)
• A dedicated network: The LHCOPN
– Large Hadron Collider
Optical Private Network
– 10 Gb lightpaths ending on sites
GCX
TNC 2009 – Málaga – 2009-06-08
4
EGEE networking support
Enabling Grids for E-sciencE
• Networking support: Support Activity 2 - SA2
– “Small” activity (~ 1.5% of overall project’s budget, ~ 7 FTEs)
– Provide a single interface between Grid and networks
Grid
ENOC
SA2
Networks
GCX
TNC 2009 – Málaga – 2009-06-08
5
The ENOC
Enabling Grids for E-sciencE
• Why a project’s NOC?
– Embed network operations in Grid operations at project level
 Scheduled network downtimes, incident reports, bandwidth issues...
– Single convenient operational interface between networks & Grid
• ENOC: EGEE Network Operation Centre
– Including all its required dependencies
 Monitoring, troubleshooting, operational tools...
• But a very particular “NOC”
– The EGEE project did not own nor manage any network devices...
– More a “workflow facilitator”
 Manpower allocated ~ 2 FTEs - not 24x7x365
GCX
TNC 2009 – Málaga – 2009-06-08
6
History
Enabling Grids for E-sciencE
• 2004-2006 – EGEE-I
– Survey and feasibility investigation
– Processes defined, prototyped and validated
• 2006-2008 – EGEE-II
– First raw implementation from scratch
 Topology database, ticket handling and analysis...
• 2008-2010 – EGEE-III
– SA2 now focused around the ENOC
 Particularly on tools (troubleshooting, tickets exchange...)
 Maturing processes and tools with lessons learnt from EGEE-II
GCX
TNC 2009 – Málaga – 2009-06-08
7
Foreseen operational process
Enabling Grids for E-sciencE
Grid
Grid operations
Users
3
3
GGUS (Grid TTS)
2
ENOC
Networks
1
NOC A
Site A
GCX
NREN A
1
NOC B
NREN B
TNC 2009 – Málaga – 2009-06-08
NREN C
Site B
8
Tools - Overview
Enabling Grids for E-sciencE
19 NRENs
+ GÉANT2
Sharing
Network topology DB
GGUS
~ 800 tickets/month
~ 2500 e-mails/month
translate,
homogenize,
sort
Impact assessment,
filtering
Maps
Trouble tickets
DB
Internal tools
Dashboard
GCX
Statistics
TNC 2009 – Málaga – 2009-06-08
Public dashboard
9
Tools - Tickets handling (1/2)
Enabling Grids for E-sciencE
• Tickets are the only operational information widely
available from network providers
• Ticket homogeniser
– Templates per NRENs to define matching criterias
 Regexp based: Match location, start date, ticket ID etc.
See related poster during TNC2009:
Grid Management: Architecture analysis of a
trouble ticket normalization and delivery service
Templates
RedIRIS
GRNET
GCX
TNC 2009 – Málaga – 2009-06-08
10
Tools - Tickets handling (2/2)
Enabling Grids for E-sciencE
• Now successfully facing a huge workflow
– For 19 NRENs ~ 2500 e-mails/month representing 800 tickets
 Very low trash ratio: ~ 5%
~ 200 tickets opened at the same time
– In our database: 102k e-mails, 27k tickets, 1GB of data
GCX
TNC 2009 – Málaga – 2009-06-08
11
Tools – Topology database (1/2)
Enabling Grids for E-sciencE
• Provide a logical view of the network
– Avoid going at too low level
 Network providers did not want to expose their topology
 Useless and might be a burden to maintain
• Schema was hard to define
– Another too complex...
GCX
TNC 2009 – Málaga – 2009-06-08
12
Tools – Topology database (2/2)
Enabling Grids for E-sciencE
• Initialy automaticaly filled from traceroute automatic
analysis (~ DNS domain name matching)
– Then humanly reviewed thanks to graphical tools
GCX
TNC 2009 – Málaga – 2009-06-08
13
Tools – Impact computation
Enabling Grids for E-sciencE
• Really the tricky part
– How to filter from all tickets received those impacting the Grid?
• Automatic impact computation attempted
– Match ticket’s locations on our adapted topology database
• If a node is affected guess all linked sites are
– Store impact and map ticket on node
– Suspected ratio impacting the Grid: ~ 15%
GCX
TNC 2009 – Málaga – 2009-06-08
14
Tools – Operational database
Enabling Grids for E-sciencE
• Topology database + operational information =
operational database
– To store network outage are impacting the Grid
GCX
TNC 2009 – Málaga – 2009-06-08
15
Tools - Monitoring
Enabling Grids for E-sciencE
• Connectivity tests: home made DownCollector
– TCP tests on all Grid nodes (~ 2000) from a central point
 Aggregated results per site
– Impact localisation using stored network checkpoints
GCX
TNC 2009 – Málaga – 2009-06-08
16
2008 breakdown from DownCollector
Enabling Grids for E-sciencE
From average assessment from DownCollector for year
2008 on EGEE certified Grid sites (~ 300):
• Network troubles are not concentrated on few sites
• More than half of connectivity problems detected are
on-sites
• 80% of off-site network
troubles are solved
within 30 minutes
• Only ~ 45/month
last more
GCX
TNC 2009 – Málaga – 2009-06-08
80%
17
Achievements around the ENOC
Enabling Grids for E-sciencE
• Downcollector - https://ccenoc.in2p3.fr/DownCollector/
– Reached 3GB of monthly traffic (web + Nagios quering)
• ASPDrawer doing BGP monitoring of LHCOPN
– Useful service assessment for 2008 and official for 2009
• Trouble ticket exchange standard
– Work around database (topology, tickets, impacts)
– Normalisation of network trouble tickets ready to be implemented
– Rendering on web interfaces
• Approaches strongly driven by automation
– Reasonable efforts to run and maintain things!
GCX
TNC 2009 – Málaga – 2009-06-08
18
Review (1/3)
Enabling Grids for E-sciencE
• Information acquired in network trouble tickets is not
formalised and accurate enough
– Plain text e-mail tickets are a plague to analyse
– Even matched correctly the meaning is often not satisfying
 Impact on services not computed
 Only targeted to local community, meaningless at project level
 Naming conventions linked to a topology database somewhere?
• This really prevents us from a successful reliable
automatic impact assessment
– And it is hard to make network providers improving that…
 What about homogenising at least interfaces within NRENs?
GCX
TNC 2009 – Málaga – 2009-06-08
19
Review (2/3)
Enabling Grids for E-sciencE
• Disclosure of network trouble tickets is a big issue
– How can they be shared?
 What about a centralised knowledge database of network issues?
• Few or wrong inquiries from Grid
– Middleware still output some very misleading error messages
• Lack of place to globally exchange with NRENs
– EGEE Technical Network Liaison Committee – TNLC was set up
 But attendance often reduced to EGEE partners...
GCX
TNC 2009 – Málaga – 2009-06-08
20
Review (3/3)
Enabling Grids for E-sciencE
• Lack of serious network monitoring is really
embarrassing
– Technical complexity due to the scale and... viewpoints
– NOC not feed, no history, no quality assessment...
 Connectivity tests good but not enough
– Good convergence toward perfSONAR solutions
 Some extra time needed to maturate and be deployed enough
• Networks are really working fine
– This was not expected in such extends
GCX
TNC 2009 – Málaga – 2009-06-08
21
Current areas of work (1/2)
Enabling Grids for E-sciencE
• e2e troubleshooting service: perfSONAR lite TSS
– perfSONAR PS based with central webinterface
– On demand measurements only
• Standard trouble tickets exchange
– Data models and software are now here
 RFC draft was submitted (2009-05)
– But what is the benefit for NRENs to deliver
standard trouble tickets?
 This might really slow down adoption…
• Trouble tickets impact matching
NODE STATE REPORTS
DE POR
IVE NO
ACT
Counter
Bad
Moderate
Moderate
Fine
ENOC
HELPDESK
SERVICES
Active monitoring &
alert processing
PRO
BLEM
NOTIF
ICATIO
NS&
TICKET
S
CRITICAL ZONE
Ticket
preprocessing
Rising to critical
– Correlate tickets with monitoring data
Node state
store
BES
GEANT/NRENS
community
TO ANALYSIS
Metaticket
store
ATTRIBUES
Statistical
Matching
Engine
(event correlator)
History
correlation
matrix
warning/critical hold zone
(1) Alert notifications
(2) Resulting node status
The matcher output (3)
Dec
7 16:34:06
loss-moderate (rising)
Dec
7 16:39:07
loss-bad
Dec
7 18:04:07
loss-moderate (resuming)
Dec
7 18:09:08
loss-fine
(rising)
Resuming to normal
Topology database
(NOD)
Rising to warning
Rising to warning
WARNING ZONE
(resuming)
normal/warning hold zone
NORMAL ZONE
Time
GCX
TNC 2009 – Málaga – 2009-06-08
22
Current areas of work (2/2)
Enabling Grids for E-sciencE
• What are traffic patterns related to the Grid?
– Full perfSONAR monitoring of Tiers 1/Tiers 2 in Spain by
RedIRIS
• LHCOPN
– SA2 is leading design and implementation of a federated model
• Also non fully ENOC related tasks: SLA, IPv6 etc.
GCX
TNC 2009 – Málaga – 2009-06-08
23
Conclusion (1/2)
Enabling Grids for E-sciencE
• Networks “seem” working really fine
– Not so many complex multi-domains issues
– Current strategy is still: If it is down, just wait it to be back
• Lot of work performed to set up the ENOC
– Simple ideas revealed technical challenges
• Unfortunately our requirements are constraints for
network providers
– No clear benefits for them to follow us
 Slowing down happy « collaboration »
– Local user community versus worldwide project...
GCX
TNC 2009 – Málaga – 2009-06-08
24
Conclusion (2/2)
Enabling Grids for E-sciencE
• Mitigated results around the ENOC
– Technical success but insufficiently used
 Now strong concurrence of local support structures
– Stoppers: Lack of some key requirements
• Near future
– European Grid Initiative (EGI) Network Support Centre (ENSC)
– No longer active roles expected in network operations
– Focused on underlying network tasks at project level
 Monitoring, advanced network services, quality assessment
• Such project wide problematic might become common
– Abstract all network providers at project level
GCX
TNC 2009 – Málaga – 2009-06-08
25
Including a joint network session
with TERENA NRENs & Grid
workshop and EGEE SA2
Enabling Grids for E-sciencE
Thank you!
Questions?
http://www.eu-egee.org/
GCX
TNC 2009 – Málaga – 2009-06-08
27
Acknowledgements
Enabling Grids for E-sciencE
• EGEE SA2 team
– Main partners









CERTH
CNRS
DANTE
DFN
GARR
GRNET
NTUA
RedIRIS
RRC-KI
• IN2P3-CC network team
• CNRS UREC
GCX
TNC 2009 – Málaga – 2009-06-08
28