Transcript GridPP - Tier1 View
Tier1 View: Resilience
Status, plans, and best practice
Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009
Overview
“How to make critical services at the T1 bullet proof”
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 2
Resilience - Why?
• Services and system components fail –
• You don’t want your services to be brought down by a failure – MoU commitments quite taxing to meet even without failures – You can’t hide from auntie SAM… • Better to deal with problems without pressure to restart services – Fewer mistakes • Even better to avoid the problems in the first place • So: design service implementation so that it *will* survive failures of whatever nature 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 3
Approaches to resilience
• Hardware – Use hardware that can survive component failure • Software – Use software that can survive problems on hardware – Use software designed for distributed operation – Use software that has inbuilt resilience • Location – Locate hosts such that a service can survive failure at host location 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 4
Hardware
Resilient hardware will help your services survive common failure modes and keep it operating until you can replace the component and make the service resilient again 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 5
Storage
• Most common is RAID as used in storage arrays • Single (RAID5) or double (RAID6) disk failures do not take out the storage array – Use of hot spares allows automatic rebuilds to maintain the resilience • RAID1 for system disks in servers – in the event of a single disk failure the server carries on – RAID1 with a hot spare can be used for super-critical systems – automatic rebuild maintains the resilience • Works with software RAID as well as hardware RAID controllers – If you set the BIOS up for hot-swap capability… • Failed disks can be replaced without taking the service down – If you have hot-swap caddies 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 6
Memory
• ECC helps systems to detect and correct single bit and multi-bit errors in the RAM – can help prevent data corruption • If the EEC correction rate begins to rise, the RAM may be failing, or need reseating, or be subject to interference, or be slipping out of tolerance.
• Higher-end kit can stop using ‘bad’ RAM – if not interrupting the service is considered worth the cost (high) 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 7
Power Supply
• Redundant PSU configurations – N+1 redundancy: at least one more PSU in a server than is needed to make it work. If one fails, the server keeps running and the failed unit can be replaced without taking the server down • Multiple power feeds – For an N+1 redundant PSU configuration, one can feed each PSU from a different PDU. If one PDU fails (and they do), or the fuse blows (and they definitely do!) the other PSU is still powered and the service can continue • UPS for systems where loss of power is a problem – Bridge blips, brownouts and short interruptions, smoothed feed, harmonic reduction – Permanent or time-limited – how much power must it provide and how long must it continue?
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 8
Interconnects
• Networking – Two or more network ports bonded can provide resilience if cables routed to different switches or via different routes – increases performance too – Bonded links in fibre installations can provide resilience against transceiver failure or fibre cuts – ‘Stacked’ switches with bi-directional stacking capability • If one cable fails, data goes the other way • If one unit fails, data can still reach the one the other side – Fail-over links in site infrastructure and national / international long haul links - fibre cuts happen with depressing regularity • Fibre-channel – Multi-port FC HBAs and array controllers can be set up to provide two independent routes from servers to storage devices with multi-path and failover support keeping the data flowing 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 9
Software
Software services should be designed to be resilient and to be provided by multiple hosts and at distributed locations.
“This is the Grid – it’s distributed. If the services aren’t distributable,
Monitoring
• If it can be monitored… • Look for and restart failed service daemons • Look for signatures of impending problems to predict component failure • Idle disks hide their faults – Regular low-level verification runs to push sick drives over the edge – Replace early in failure cycle • So it doesn’t fail during a rebuild… • Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation – If you have redundant links, you can replace the faulty one and keep the service going • Call-out system for problems that impact services 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 11
Multiple hosts
• Services can be provided by more than one host if the application supports it – Share the load and increase performance – If one host fails, the rest provide the service – Use DNS round-robin to ‘randomly’ select a host using a service alias with short TTL – Take broken host/s out of active DNS – Avoid single-points-of-failure • Can locate multiple hosts… – … in different rooms – … in different buildings – … at different sites 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 12
Tier1
Resilience steps at the Tier1… 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 13
Hardware at the Tier1
• Most of the hardware techniques are used at the Tier1 • Bulk storage uses RAID1/5/6, ECC RAM, N+1 PSUs, multiple power feeds, regular verifies of arrays (scrubbing) • Services nodes use RAID1, ECC RAM, some with N+1 PSUs • Databases: RAID1/10/5/6, ECC RAM, N+1 PSU, dual FC links, multiple power feeds • Networking: redundant off-site link to SJ5 – working on redundancy (failover/backup) for OPN link to CERN • UPS (in the new building) – 24/7 UPS for critical services / database racks – Short-lived UPS for storage systems to allow clean shutdown 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 14
Single CASTOR Instance eg CMS srm Shared Castor Core (Neptune) ORACLE RAC (Pluto)
CASTOR Service
In general (all for CMS) Stager mirror disks on stager/lsf LSF Master master and rmmaster rmmaster ns mirror disks LSF licence 2 April 2009 FC ARRAY Resilience at the Tier1 - Martin Bly - GridPp22 15
3D Services + LHCB LFC
3D lhcb lfc readonly replica, single host, fast kickstart failover to CERN 3D ORACLE RAC 2 April 2009 FC ARRAY Resilience at the Tier1 - Martin Bly - GridPp22 16
FTS and General LFC
5 Web Front Ends in DNS RR Hot spare soon FTS 1 channel / VO agent host ( raid 1) LFC DNS RR LFC currently single Host. Second host planned for mid September work in progress, running late 2 April 2009 RAID 10 SAN Oracle RAC Resilience at the Tier1 - Martin Bly - GridPp22 Oracle currently 2 independent servers.
Work active to deploy 3 server RAC 17
dn to account mapping Mirrored disks
CE and Fabric
ce02 03 04 05 CE 3 doublets, one for each of ATLAS CMS and LHCB each CE has Mirror disks torque/maui 2 April 2009 /home file system (hardware RAID) Resilience at the Tier1 - Martin Bly - GridPp22 NIS 18
CE/SRM instances
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 19
WMS WMS triplet, LB doublet • Now: – lcgwms01 – LHC – lcgwms02 – everyone – lcgwms03 – non-LHC • Developments: – lcgwms01 – LHC – lcgwms02 – LHC – lcgwms03 – non-LHC • All WMS use both LB systems 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22
WMS and LB
LB 20
Other Tier1 Services
• UK-BDII: – DNS R-R triplet of simple hosts – Copes with load, provides resilience – Easy kickstart for rapid instancing • RGMA registry: – single host, RAID disks, easy kickstart • MONbox: – single host, RAID disks, easy kickstart • VO boxes: – several x single host, easy kickstart • Site BDII – DNS R-R doublet of simple hosts (same as UK-BDII) • PROXY – Doublet of simple hosts, easy kickstart • GOCDB: – internal failover with alternative database, (oracle), and external failover to another web front-end in Germany and mirrored database in Italy. Latter still being tested.
• Apel: – has a warm standby and is buying new hardware.
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 21
Tier1 Monitoring
• Catch problems early with nagios where possible (or at least catch problems before anyone notices) – load alarms – File systems near to full – certificates close to expiry – Failed drives • Some ganglia/cacti capacity planning reviews (but ad hoc) looking for long term trends. Service Operations team making a difference.
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 22
Tier1 Backups
• Critical hosts all backed up to tape store • Tape details written to central loggers – So we can find which tape numbers to restore if the host is toast • Speedy restores to toasted systems • Verify and exercise backups… 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 23
Tier1 On-call
• A good driver for service improvement.
• Continuous improvement process with weekly review of night-time incidents • Review is driver for: – Auto-restarters (team still not 100% keen) – Improved monitoring (more plugins) – Better response documentation. – Changes to processes • Also runs daytime • Gradually routine operations will become more and more the responsibility of the service intervention team.
• CASTOR team carry out “weekly” detailed review of all incidents (looking to see how to avoid them again). Will generalise to whole Tier-1 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 24
Tier1 People
• Several teams with some degree of expertise sharing within each team – Fabric, Grid/Support, CASTOR, Databases – This has been pretty successful and we are reasonably confident we can handle tractable problems without the specialist present • As far as is reasonable fair/practicable we seek to ensure leave is scheduled to ensure expert cover – not always possible • On-call also spreading expertise in critical services (e.g., even the Facility Manager knows how to restart the CASTOR request handler!) • Able to call upon RAL Tier-2 staff (or other GRIDPP/elsewhere) in case of complete lack of expertise. Have done this occasionally. Should probably be prepared to do it more often.
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 25
Off Site services
• A few critical services are candidates for off-site replication, others such as BDIIs, LHCB LFC are already federated • Possible candidates: FTS and general LFC (possibly RGMA) – Both essential to GRIDPP – LFC based on Oracle • Streaming technology already deployed and tested elsewhere (3D) – RAL could operate these remotely, but existing configuration very expensive (£40K hardware) plus Oracle licences. Failover to new DNS names would also need to be site resilient (not trivial). May be worth exploring with nearby sites or Daresbury 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 26
Questions
To Andrew, please…!
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 27