GridPP - Tier1 View

Download Report

Transcript GridPP - Tier1 View

Tier1 View: Resilience

Status, plans, and best practice

Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009

Overview

“How to make critical services at the T1 bullet proof”

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 2

Resilience - Why?

• Services and system components fail – happens!

• You don’t want your services to be brought down by a failure – MoU commitments quite taxing to meet even without failures – You can’t hide from auntie SAM… • Better to deal with problems without pressure to restart services – Fewer mistakes • Even better to avoid the problems in the first place • So: design service implementation so that it *will* survive failures of whatever nature 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 3

Approaches to resilience

• Hardware – Use hardware that can survive component failure • Software – Use software that can survive problems on hardware – Use software designed for distributed operation – Use software that has inbuilt resilience • Location – Locate hosts such that a service can survive failure at host location 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 4

Hardware

Resilient hardware will help your services survive common failure modes and keep it operating until you can replace the component and make the service resilient again 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 5

Storage

• Most common is RAID as used in storage arrays • Single (RAID5) or double (RAID6) disk failures do not take out the storage array – Use of hot spares allows automatic rebuilds to maintain the resilience • RAID1 for system disks in servers – in the event of a single disk failure the server carries on – RAID1 with a hot spare can be used for super-critical systems – automatic rebuild maintains the resilience • Works with software RAID as well as hardware RAID controllers – If you set the BIOS up for hot-swap capability… • Failed disks can be replaced without taking the service down – If you have hot-swap caddies 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 6

Memory

• ECC helps systems to detect and correct single bit and multi-bit errors in the RAM – can help prevent data corruption • If the EEC correction rate begins to rise, the RAM may be failing, or need reseating, or be subject to interference, or be slipping out of tolerance.

• Higher-end kit can stop using ‘bad’ RAM – if not interrupting the service is considered worth the cost (high) 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 7

Power Supply

• Redundant PSU configurations – N+1 redundancy: at least one more PSU in a server than is needed to make it work. If one fails, the server keeps running and the failed unit can be replaced without taking the server down • Multiple power feeds – For an N+1 redundant PSU configuration, one can feed each PSU from a different PDU. If one PDU fails (and they do), or the fuse blows (and they definitely do!) the other PSU is still powered and the service can continue • UPS for systems where loss of power is a problem – Bridge blips, brownouts and short interruptions, smoothed feed, harmonic reduction – Permanent or time-limited – how much power must it provide and how long must it continue?

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 8

Interconnects

• Networking – Two or more network ports bonded can provide resilience if cables routed to different switches or via different routes – increases performance too – Bonded links in fibre installations can provide resilience against transceiver failure or fibre cuts – ‘Stacked’ switches with bi-directional stacking capability • If one cable fails, data goes the other way • If one unit fails, data can still reach the one the other side – Fail-over links in site infrastructure and national / international long haul links - fibre cuts happen with depressing regularity • Fibre-channel – Multi-port FC HBAs and array controllers can be set up to provide two independent routes from servers to storage devices with multi-path and failover support keeping the data flowing 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 9

Software

Software services should be designed to be resilient and to be provided by multiple hosts and at distributed locations.

“This is the Grid – it’s distributed. If the services aren’t distributable, rewrite them.” – anon 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 10

Monitoring

• If it can be monitored… • Look for and restart failed service daemons • Look for signatures of impending problems to predict component failure • Idle disks hide their faults – Regular low-level verification runs to push sick drives over the edge – Replace early in failure cycle • So it doesn’t fail during a rebuild… • Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation – If you have redundant links, you can replace the faulty one and keep the service going • Call-out system for problems that impact services 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 11

Multiple hosts

• Services can be provided by more than one host if the application supports it – Share the load and increase performance – If one host fails, the rest provide the service – Use DNS round-robin to ‘randomly’ select a host using a service alias with short TTL – Take broken host/s out of active DNS – Avoid single-points-of-failure • Can locate multiple hosts… – … in different rooms – … in different buildings – … at different sites 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 12

Tier1

Resilience steps at the Tier1… 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 13

Hardware at the Tier1

• Most of the hardware techniques are used at the Tier1 • Bulk storage uses RAID1/5/6, ECC RAM, N+1 PSUs, multiple power feeds, regular verifies of arrays (scrubbing) • Services nodes use RAID1, ECC RAM, some with N+1 PSUs • Databases: RAID1/10/5/6, ECC RAM, N+1 PSU, dual FC links, multiple power feeds • Networking: redundant off-site link to SJ5 – working on redundancy (failover/backup) for OPN link to CERN • UPS (in the new building) – 24/7 UPS for critical services / database racks – Short-lived UPS for storage systems to allow clean shutdown 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 14

Single CASTOR Instance eg CMS srm Shared Castor Core (Neptune) ORACLE RAC (Pluto)

CASTOR Service

In general (all for CMS) Stager mirror disks on stager/lsf LSF Master master and rmmaster rmmaster ns mirror disks LSF licence 2 April 2009 FC ARRAY Resilience at the Tier1 - Martin Bly - GridPp22 15

3D Services + LHCB LFC

3D lhcb lfc readonly replica, single host, fast kickstart failover to CERN 3D ORACLE RAC 2 April 2009 FC ARRAY Resilience at the Tier1 - Martin Bly - GridPp22 16

FTS and General LFC

5 Web Front Ends in DNS RR Hot spare soon FTS 1 channel / VO agent host ( raid 1) LFC DNS RR LFC currently single Host. Second host planned for mid September work in progress, running late 2 April 2009 RAID 10 SAN Oracle RAC Resilience at the Tier1 - Martin Bly - GridPp22 Oracle currently 2 independent servers.

Work active to deploy 3 server RAC 17

dn to account mapping Mirrored disks

CE and Fabric

ce02 03 04 05 CE 3 doublets, one for each of ATLAS CMS and LHCB each CE has Mirror disks torque/maui 2 April 2009 /home file system (hardware RAID) Resilience at the Tier1 - Martin Bly - GridPp22 NIS 18

CE/SRM instances

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 19

WMS WMS triplet, LB doublet • Now: – lcgwms01 – LHC – lcgwms02 – everyone – lcgwms03 – non-LHC • Developments: – lcgwms01 – LHC – lcgwms02 – LHC – lcgwms03 – non-LHC • All WMS use both LB systems 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22

WMS and LB

LB 20

Other Tier1 Services

• UK-BDII: – DNS R-R triplet of simple hosts – Copes with load, provides resilience – Easy kickstart for rapid instancing • RGMA registry: – single host, RAID disks, easy kickstart • MONbox: – single host, RAID disks, easy kickstart • VO boxes: – several x single host, easy kickstart • Site BDII – DNS R-R doublet of simple hosts (same as UK-BDII) • PROXY – Doublet of simple hosts, easy kickstart • GOCDB: – internal failover with alternative database, (oracle), and external failover to another web front-end in Germany and mirrored database in Italy. Latter still being tested.

• Apel: – has a warm standby and is buying new hardware.

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 21

Tier1 Monitoring

• Catch problems early with nagios where possible (or at least catch problems before anyone notices) – load alarms – File systems near to full – certificates close to expiry – Failed drives • Some ganglia/cacti capacity planning reviews (but ad hoc) looking for long term trends. Service Operations team making a difference.

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 22

Tier1 Backups

• Critical hosts all backed up to tape store • Tape details written to central loggers – So we can find which tape numbers to restore if the host is toast • Speedy restores to toasted systems • Verify and exercise backups… 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 23

Tier1 On-call

• A good driver for service improvement.

• Continuous improvement process with weekly review of night-time incidents • Review is driver for: – Auto-restarters (team still not 100% keen) – Improved monitoring (more plugins) – Better response documentation. – Changes to processes • Also runs daytime • Gradually routine operations will become more and more the responsibility of the service intervention team.

• CASTOR team carry out “weekly” detailed review of all incidents (looking to see how to avoid them again). Will generalise to whole Tier-1 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 24

Tier1 People

• Several teams with some degree of expertise sharing within each team – Fabric, Grid/Support, CASTOR, Databases – This has been pretty successful and we are reasonably confident we can handle tractable problems without the specialist present • As far as is reasonable fair/practicable we seek to ensure leave is scheduled to ensure expert cover – not always possible • On-call also spreading expertise in critical services (e.g., even the Facility Manager knows how to restart the CASTOR request handler!) • Able to call upon RAL Tier-2 staff (or other GRIDPP/elsewhere) in case of complete lack of expertise. Have done this occasionally. Should probably be prepared to do it more often.

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 25

Off Site services

• A few critical services are candidates for off-site replication, others such as BDIIs, LHCB LFC are already federated • Possible candidates: FTS and general LFC (possibly RGMA) – Both essential to GRIDPP – LFC based on Oracle • Streaming technology already deployed and tested elsewhere (3D) – RAL could operate these remotely, but existing configuration very expensive (£40K hardware) plus Oracle licences. Failover to new DNS names would also need to be site resilient (not trivial). May be worth exploring with nearby sites or Daresbury 2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 26

Questions

To Andrew, please…!

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 27