WLCG: Experience from Real Data Taking at the LHC

Download Report

Transcript WLCG: Experience from Real Data Taking at the LHC

WLCG: Experience from Real
Data Taking at the LHC
[email protected]
WLCG Service Coordination
& EGI InSPIRE SA3 / WP6
TNC2010, Vilnius, Lithuania
More information: LHC Physics Centre at CERN
Overview
• Summarize the Worldwide LHC Computing Grid:
– In terms of scale & scope;
– In terms of service.
• Highlight the main obstacles faced in arriving at
today’s situation
• Issues, concerns & directions for the future
• As this presentation will be given remotely, I have
deliberately kept it as free as possible from any
graphics or animations
WLCG: Experience from Real Data Taking
2
The Worldwide LHC Grid (WLCG)
• Simply put, this is the distributed processing and storage
system deployed to handle the data from the world’s largest
scientific machine – the Large Hadron Collider (LHC)
• Based today on grid technology – including the former EGEE
infrastructure in Europe plus the Open Science Grid in US
• WLCG is more than simply a customer of EGEE: it has been
and continues to be a driving force not only in the grid
domain but also others, such as storage and data
management
• WLCG has always been about a production service – one
that is needed 24 x 7 most days (362) per year
– Much activity – particularly at Tier0 and Tier0-Tier1 transfers –
takes place at nights and over weekends (accelerator cycle)
WLCG: Experience from Real Data Taking
3
The WLCG Deployment Model
• WLCG is the convergence of grid technology with a specific
deployment model, elaborated in the late 1990s in the
“Modelling of Network & Regional Centres” (MONARC) project
• This defined the well-known hierarchy Tier0/Tier1/Tier2 that is
now common to several disciplines and matches well to
International Centre / National Centres / Local Institutes
• MONARC originally foresaw limited networking between
Tier0/Tier1/Tier2s – with air freight as a possible backup to
(best case) 622Mbps links (cost!), as well as a smaller number
of centres than we have today
– We have redundant 10Gbps links: T0-T1 & also T1-T1, some of
which are on occasion max-ed out!
• These base assumptions are currently being re-discussed
WLCG: Experience from Real Data Taking
4
The WLCG Scale
• Deployed worldwide: Americas, Europe & Asia-Pacific
• Computational requirements: O(105) cores
• Networking requirements: routinely move 1PB of data per
day between grid sites – significant intra-site requirements
• Single VO transfers CERN-Tier1s >4GB/s over sustained
periods (~days)
• Annual growth in stored data: 15PB
– Old calculation: # copies & location(s) of data may well be
revised in coming months as well as trigger rates & event sizes
• Sum of resources at each tier approximately equal
– 1 Tier0, ~10 Tier1s, ~100 Tier2s
• Sum of tickets at each tier (service metric) also ~equal!
– A few: rarely as many as 5 per VO per day (OPS meeting)
WLCG: Experience from Real Data Taking
5
The WLCG Service
• The WLCG Service is “held together” through (week)daily operations calls /
meetings – remote participation is a requirement in all meetings!
– Attended by representatives from all main LHC experiments, Tier1 sites and
(typically) OSG & GridPP & chaired by a WLCG Service Coordinator on Duty
• Follow-up on key service issues, site problems & accelerator / experiment
news
– Calls typically last 30’ with notes – based on pre-minutes – circulated shortly after
• Longer-term issues, including roll-out of new releases, handled at fortnightly
WLCG Service Coordination meetings
• In addition, regular “Collaboration” and topical workshops
– Collaboration workshops attended by 200 – 300 people, many of whom (e.g.
Tier2s) do not attend more frequent events
• Experiments also hold regular meetings with their sites, as well as
“Jamborees” – WLCG adds value in helping to identify / provide common
solutions and/or when dealing with sites supporting multiple VOs
– Typically the case for Tier1 sites outside of North America
WLCG: Experience from Real Data Taking
6
WLCG: Service Reporting
• Regular service reports are made to the WLCG
Management Board based on Key
Performance Indicators together with reports
on any major service incident (“SIRs”)
• The KPIs: Site Usability based on VO tests;
GGUS summaries (user, team & alarm tickets);
# of SIRs & their nature
• This can be summarized in one (colour) slide:
drill-down provided when not A-OK (usually…)
WLCG: Experience from Real Data Taking
7
WLCG Operations Report – Summary
KPI
Status
Comment
GGUS tickets
1 “real” alarm, 4 test
alarms; normal # team and
user tickets.
Drill-down on real
alarms;
comment on tests.
Site Usability
Minor issues
Drill-down (hidden)
SIRs & Change assessments Four incident reports
VO
User
Team
Alarm
Total
ALICE
6
1
1
8
ATLAS
21
67
1
89
CMS
2
3
1
6
LHCb
0
25
2
27
Totals
29
96
5
130
Drill-down on each
8
WLCG: Team & Alarm tickets
• These are two features – along with the weekly
ticket summaries – that were introduced for
WLCG into GGUS
• Team tickets – shared by “shifters” and passed
on from one shift to another
• Alarm tickets – when you need to get someone
out of bed: only named, authorized people may
use them; only for agreed “critical services”
• Targets for expert intervention & problem
resolution: statistics fortunately rather low
WLCG: Experience from Real Data Taking
9
WLCG: Recent Problems
• Three real problems from the most recent MB
report: 2 network-related and 1 data / storage
management
1. Loss of DNS in Germany(!) – GGUS black-out
2. Network performance on CERN – Taiwan link:
delays in problem resolution (problem turned
out to be near Frankfurt) due to holiday w/e
 Major data issue: a combination of events led
to data loss which IMHO cannot be tolerated
WLCG: Experience from Real Data Taking
10
WLCG Networking Issues
• Network works well most of the time: adequate
bandwidth, backup links
• Occasional problems often lengthy (days) to
resolve: future direction is to be more networkcentric (breaking MONARC model, more flexible
“hierarchy”, remote data access)
• Network problems: lines cut by motorway
construction, fishing trawlers & Tsunamis
 Tighter integration of network operations with
WLCG operations already needed
WLCG: Experience from Real Data Taking
11
WLCG Service: Challenges
• The “Service Challenge” programme was initiated in 2004 to
deliver a full production service capable of handling data
taking needs well in advance of real data taking
• Whilst we have recently been congratulated for the status
and stability of the service, it has been much more difficult
and taken much longer than anyone anticipated
• We should have been where we are today 2-3 years ago!
• The service works with most problems being addressed
sufficiently quickly – but there are still many improvements
needed and the operations cost is high (sustainable?)
• These include adapting to constantly evolving technology
(multi-core, virtualization, changes in the data & storage
management landscape, clouds computing, …)
• As well as the move from EGEE to EGI…
WLCG: Experience from Real Data Taking
12
(Some) Data Related Issues
Data Preservation
Storage Management
Data Management
Data Access
WLCG & Data: Futures
• Just as our MONARC+grid choice was made some time
ago and is now being reconsidered, so too were our data
management strategies
• Whilst for a long time HEP’s data volumes and rates were
special, they no longer are, nor are our requirements
along other axes, such as Data Preservation
• “Standard building blocks” now look close to production
readiness for at least some of our needs (tape access &
management layer, cluster/parallel filesystems, data
access…) requiring still some “integration glue” which can
hopefully be shared with other disciplines
• “Watch this space” – data & storage management in HEP
looks about to change (but not too much: it takes a long
time to roll out new services and/or migrate multi-PB of
data…)
WLCG: Experience from Real Data Taking
14
WLCG & EGEE
• The hardening of the WLCG service took place
throughout the entire lifetime of the EGEE project
series: 2004 – 2010
• The service continues to evolve: one recent change was
in the scheduling of site / service interventions with
respect to LHC machine stops: agreed no more than 3
sites or 30% of the total Tier1 resources to a given VO
can be (scheduled) down at the same time
• Another was the introduction of Change / Risk
assessments prior to service interventions: still not
sufficiently rigorous
• Constant service evolution is clearly measurable over a
time scale of months, e.g. via the Quarterly Reports
WLCG: Experience from Real Data Taking
15
EGI InSPIRE SA3 / WP6
• Another significant change or transition that we are
now facing is the move from EGEE to EGI
• WP6 “Services for Heavy Users” is the only WP in
InSPIRE led by CERN and addresses a variety of
disciplines: HEP, LS, ES, A&A, F, CCMST
• Intent: continue to seek commonalities and
synergies, not only within but also across disciplines
• Also “bind” related projects in which CERN is
involved in LS & ES areas, plus links with A&A + F
WLCG: Experience from Real Data Taking
16
SA3 – Long Term Goals
• One of the key issues that needs to be addressed by this
Work Package is a model for long-term sustainability
• This could have multiple elements:
– Some services moving to standard infrastructure;
– Some of the current effort coming from the VOs themselves:
this is the standard model in HEP since many years!
– A nucleus of expertise to assist with migrations to new
service versions and/or larger changes should nevertheless
be foreseen: for HEP this needs to be higher than today
• Expect to elaborate on these issues through regular
workshops and conferences, including EGI Technical and
User Fora plus other multi-disciplinary events
– IEEE NSS & MIC brings together a number of those in SA3
WLCG: Experience from Real Data Taking
17
WLCG, the LHC & Experiments
• Metric for WLCG’s success:
– WLCG seen and acknowledged as delivering a service
that is an essential part of the LHC physics programme
• Metric for EGI InSPIRE SA3’s success:
– SA3 seen and acknowledged both within and across
the disciplines that it supports but also beyond: the
longer term goals of RoI to science and society
 Investment in Research & Education is essential:
the motivation clearly goes beyond direct returns
WLCG: Experience from Real Data Taking
18
Summary
• WLCG delivers a production service at the Terra &
peta scale, it permits local investment and
exploitation, solving the “brain-drain” and related
problems of previous solutions
• It has taken longer & been a lot harder than foreseen
• Many of the basic ingredients work well and are
applicable beyond grids; complexity has not always
been justified
• Expect some changes, e.g. in Data Management + a
more network-centric deployment model, in coming
years: these changes need to be adiabatic
WLCG: Experience from Real Data Taking
19
LHCC Referees…
• “First experience of the World-wide LHC
Computing Grid (WLCG) with the LHC has been
positive. This is very much due to the
substantial effort invested over several years
during the intensive testing phase and all Tier
centres must take credit for this success. The
LHCC congratulates the WLCG on the
achievements.”
May 2010
WLCG: Experience from Real Data Taking
20
BACKUP
WLCG: Experience from Real Data Taking
21
Intervention & Resolution Targets
• Targets (not commitments) proposed for Tier0 services
– Similar targets requested for Tier1s/Tier2s
– Experience from first week of CCRC’08 suggests targets for problem
resolution should not be too high (if ~achievable)
• The MoU lists targets for responding to problems (12 hours for T1s)
¿ Tier1s: 95% of problems resolved <1 working day ?
¿ Tier2s: 90% of problems resolved < 1 working day ?
 Post-mortem triggered when targets not met!
Time Interval
End 2008
Issue (Tier0 Services)
Consistent use of all WLCG Service Standards
Target
100%
30’
Operator response to alarm / phone call
99%
1 hour
Operator response to alarm / phone call
100%
4 hours
Expert intervention in response to above
95%
8 hours
Problem resolved
90%
24 hours
Problem resolved
99% 2222