The Grid for Industry - GridPP - UK Computing for Particle

Download Report

Transcript The Grid for Industry - GridPP - UK Computing for Particle

Tier-1 Overview
Andrew Sansum
21 November 2007
Overview of Presentations
• Morning Presentations
–
Overview (Me)
• Not really overview – at request of Tony mainly MoU commitments
–
CASTOR (Bonny)
• Storing the data and getting it to tape
–
Grid Infrastructure (Derek Ross)
• Grid Services
• dCache future
• Grid Only Access
–
Fabric Talk (Martin Bly)
• Procurements
• Hardware infrastructure (inc Local Network)
• Operation
• Afternoon Presentations
–
–
–
Neil (RAL benefits)
Site Networking (Robin Tasker)
Machine Rooms (Graham Robinson)
What I’ll Cover
• Mainly going to cover MoU commitments
–
–
–
–
Response Times
Reliability
On-Call
Disaster planning
• Also cover staffing
GRIDPP2 Team Organisation
Grid Services
Fabric
CASTOR
Grid/exp
Support
(H/W and OS)
SW/Robot
Ross
Condurache
Hodges
Klein (PPS)
Vacancy
Bly
Wheeler
Vacancy
Thorne
White (OS support)
Adams (HW support)
Corney (GL)
Strong (Service Manager)
Folkes (HW Manager)
deWitt
Jensen
Kruk
Ketley
Jackson (CASE)
Prosser (Contractor)
(Nominally 5.5 FTE)
Machine Room operations (1.5 FTE)
Networking Support (0.5 FTE)
Database Support (0.5 FTE) (Brown)
Project Management (Sansum/Gordon/(Kelsey)) (1.5 FTE)
Staff Evolution to GRIDPP3
• Level
– GRIDPP2 (13.5 GRIDPP + 3.0 e-Science)
– GRIDPP3 (17.0 GRIDPP + 3.4 e-Science)
• Main changes
– Hardware repair effort 1->2 FTE
– New incident response team (2 FTE)
– Extra castor effort (0.5 FTE) (but this is already effort that has been
working on CASTOR unreported.
– Small changes elsewhere
• Main problem
– We have injected 2 FTE of effort temporarily into CASTOR. Long term
GRIDPP3 plan funds less effort than current experience suggests that
we need.
WLCG/GRIDPP MoU Expectations
Service
Maximum delay in responding to operational problems
Service
interru
ption
Average availability measured
on an annual basis
Degradation of the capacity
of the service by
more than 50%
Degradation of the capacity
of the service by
more than 20%
During accelerator
operation
At all other times
Acceptance of data from the
Tier-0
12 hours
12 hours
24 hours
99%
n/a
Networking service to the
Tier-0 during
accelerator
operation
12 hours
24 hours
48 hours
98%
n/a
Data-intensive analysis
services, including
networking to Tier-0,
Tier-1 centres
24 hours
48 hours
48 hours
98%
98%
All other services – prime
service hours[1]
2 hour
2 hour
4 hours
98%
98%
All other services – other
times
24 hours
48 hours
48 hours
97%
97%
[1]
Prime service hours are 08:00-18:00 during the working week of the centre, except public holidays.
Response Time
• Time to acknowledge fault ticket
• 12-48 hour response time outside prime shift
• On-call system should easily cover this provided possible to
automatically classify problem tickets by level of service
required.
• Cover during prime shift more challenging (2-4 hours) but is
already a routine task for Admin on Duty
• To hit availability target must be much faster (2 hours or less)
Reliability
• Have made good progress in last 12 months
– Prioritised issues affecting SAM test failures.
– Introduced “issue tracking” and weekly reviews of
outstanding issues.
– Introduced resilience into trouble spots (but more still to
do)
– Moved services to appropriate capacity hardware,
seperated services, etc etc.
– Introduced new team role: “Admin on Duty”. Monitoring
farm operation, ticket progression, EGEE broadcast info.
• Best Tier-1 averaged over last 3 months (other than
CERN).
RAL-LCG2 Availability
MoU Commitments (Availability)
• Really reliability (availability while scheduled up)
• Still tough – 97-99% service availability will be hard (1% is just 87 hours per
year).
– OPN reliability predicted to be 98% without resilience, site SJ5 connection is
much better (Robin will discuss).
– Most faults (75%) will fall outside normal working hours
– Software components still changing (eg CASTOR upgrades, WMS) etc.
– Many faults in 2008 will be “new” only emerging as WLCG ramps up to full
load.
– Emergent faults can take a long time to diagnose and fix (days)
• To improve on current availability will need to:
–
–
–
–
Improve automation
Speed up manual recovery process
Improve monitoring further
Provide on-call
On-Call
• On-Call will be essential in order to meet response
and availability targets.
• On-Call project now running (Matt Hodges), target is
to have on-call operational by March 2008.
• Automation/recovery/monitoring all important parts
of on-call system. Avoid callouts by avoiding
problems.
• May be possible to have some weekend on-call cover
before March for some components.
• On-call will continue to evolve after March as we
learn from experience.
Disaster Planning (I)
• Extreme end of availability problem. Risk analysis exists, but
aging and not fully developed.
• Highest Impact risks:
– Extended environment problem in machine room
•
•
•
•
Fire
Flood
Power Failure
Cooling failure
– Extended network failure
– Major data loss through loss of CASTOR metadata
– Major security incident (site or Tier-1)
Disaster Planning (II)
• Some disaster plan components exist
–
–
–
–
Disaster plan for machine room. Assuming equipment is undamaged, relocate and
endeavour to sustain functions but at much reduced capacity.
Datastore (ADS) disaster recovery plan developed and tested
Network plan exists
Individual Tier-1 systems have documented recovery processes and fire-safe backups or
can be instanced from kickstart server. Not all these are simple nor are all fully tested.
• Key Missing Components
–
–
–
National/Global services (RGMA/FTS/BDII/LFC/…). Address by distributing elsewhere.
Probably feasible and is necessary – 6 months.
CASTOR – All our data holdings depend on integrity of catalogue. Recover from first
principles not tested. Is flagged as a priority area but balance against need to make
CASTOR work.
Second – independent Tier-1 build infrastructure to allow us to rebuild Tier-1 at new
physical location. Would allow us to address major issues such as fire. Major project –
priority?
Conclusions
• Made a lot of progress in many areas this year.
Availability improving, hardware reliable,
CASTOR working quite well and upgrades ontrack.
• Main challenges for 2008 (data taking)
– Large hardware installations and almost immediate
next procurement
– CASTOR at full load
– On-call and general MoU processes