RAL Site Report Martin Bly HEPiX @ SLAC – 11-13 October 2005

Download Report

Transcript RAL Site Report Martin Bly HEPiX @ SLAC – 11-13 October 2005

RAL Site Report
Martin Bly
HEPiX @ SLAC – 11-13 October 2005
Overview
•
•
•
•
•
Intro
Hardware
OS/Software
Services
Issues
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
RAL T1
• Rutherford Appleton Lab hosts the UK LCG Tier-1
– Funded via GridPP project from PPARC
– Supports LCG and UK Particle Physics users
• VOs:
–
–
–
–
LCG: Atlas, CMS, LHCb, (Alice), dteam
Babar
CDF, D0, H1, Zeus
Bio, Pheno
• Expts:
– Minos, Mice, SNO, UKQCD
• Theory users
•…
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Tier 1 Hardware
• ~950 CPUs in batch service
– 1.4GHz, 2.66GHz, 2.8GHz – P3 and P4/Xeon (HT off)
– 1.0GHz systems retiring as they fail, phase out end Oct '05
– New procurement
• Aiming for 1400+ SPECint2000/CPU
• Systems for testing as part of evaluation of tender
• First delivery early '06, second delivery in April/May '06
• ~40 systems for services (FEs, RB, CE, LCG servers,
loggers etc)
• 60+ disk servers
– Mostly SCSI attached IDE or SATA ~220TB unformatted
– New procurement: probably PCI/SATA solution
• Tape robot
– 6K slots, 1.2PB, 10 drives
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Tape Robot / Data Store
• Current data: 300TB, PP -> 200+TB (110TB Babar)
• Castor 1 system trials
– Many CERN-specifics
• HSM (Hierarchical Storage Manager)
– 500TB, DMF (Data Management Facility)
•
•
•
•
SCSI/FC
Real file system
Data migrates to tape after inactivity
Not for PP data
– Due November 05
• Procurement for a new robot underway
–
–
–
–
–
3PB, ~10 tape drives
Expect to order end Oct 05
Delivery December 05
In service by March 06 (for SC4)
Castor system
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Networking
• Tier-1 backbone at 4x1Gb/s
–
Upgrading some links to 10Gb/s
• Multi-port 10Gb/s layer-2 switch stack as hub when available
• 1Gb/s production link Tier-1 to RAL site
• 1Gb/s link to SJ4 (internet)
–
1Gb/s HW firewall
• Upgrade site backbone to 10Gb/s expected late '05, early '06
–
–
Link Tier-1 to site at 10Gb/s – possible mid-2006
Link site to SJ5 @ 10Gb/s – mid '06
• Site firewall remains an issue – limit 4Gb/s
• 2x1Gb/s link to UKLight
–
–
–
Separate development network in UK
Links to CERN @ 2Gb/s, Lancaster @ 1Gb/s (pending)
Managed ~90MB/s during SC2, less since
• Problems with small packet loss causing traffic limitations
–
–
–
Tier-1 to UKLight upgrade to 4x1Gb/s pending,10Gb/s possible
UKLight link to CERN requested @ 4Gb/s for early '06
Over-running hardware upgrade (4 days expanded to 7 weeks)
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Tier1 Network Core – SC3
ADS Caches
Non-SC
hosts
dCache
pools
dCache
pools
Gridftp
servers
Non-SC
hosts
11-13 October 2005
RAL
Site
N x 1Gb/s
5510-1
7i-1
4x
1Gb/s
N x 1Gb/s
Router
FW
A
1Gb/s
1Gb/s
1Gb/s
to SJ4
4 x 1Gb/s
2 x 1Gb/s
to CERN
5510-2
4x
1Gb/s
7i-3
2x
1Gb/s
UKLight
Router
RAL Site Report - HEPiX @ SLAC
290Mb/s
to Lancaster
OS/Software
• Main services:
– Batch, FEs, CE, RB… : SL3 (3.0.3, 3.0.4, 3.0.5)
• LCG 2_6_0
• Torque/MAUI
• 1 Job/CPU
– Disk: RH72 custom, RH73 custom
– Some internal services on SL4 (loggers)
– Project to use SL4.n for disk servers underway
• Solaris disk servers decommissioned
– Most hardware sold
• AFS on AIX
– Transarc
– Project to move to Linux (SL3/4)
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Services (1) - Objyserv
• Objyserv database service (Babar)
– Old service on traditional NFS server
• Custom NFS, heavily loaded, unable to cope with increased
activity on batch farm due to threading issues in server
• Additional server solution with same technology not tenable
– New service:
• Twin ams-based servers, 2 CPUs, HT on, 2 GB RAM
• SL3, RAID1 data disks
• 4 servers per host system
– Internal redirection using iptables to different server ports
depending which of 4 IP addressed used to make the connection
• Able to cope with some ease: 600+ clients
• Contact Chris Brew
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Services (2)
Home file system
• Home file system migration
– Old system:
• ~85GB on A1000 RAID array
• Sun Ultra10, Solaris 2.6, 100Mb/s NIC
• Failed to cope with some forms of pathological use 
– New system:
•
•
•
•
~270GB SCSI RAID5, 6 disk chassis
2.4GHz Xeon, 1GB RAM, 1Gb/s NIC
SL3, ext3
Stable under I/O and quota testing, and during backup
– Migration:
• 3 weeks planning
• 1 week of nightly rsync followed by checksuming
– Convince ourselves the rsync works
• 1 day farm shutdown to migrate
• 1 single file detected to have checksum error
– Quotas for users unchanged…
– Keep the old system on standby to restore its backups
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Services (3) – Batch Server
•
Catastrophic disk failure on Saturday late evening over a holiday weekend
–
•
Staff not expected back till 8:30am Wednesday
Problem noted Tuesday morning
–
–
Initial inspection - disk a total failure
No easy access to backups
• Backup tape numbers in logs on failed disk!
–
–
•
No easy recovery solution with no other system staff available
Jobs appear happy – terminating OK, sending sandboxes to gatekeeper etc. But no accounting
data, no new jobs started.
Wednesday:
–
–
–
Hardware `revised’ with two disks, Software RAID1, clean install of SL3
Backups located, batch/scheduling configs recovered from tape store
System restarted with MAUI off to allow Torque to sort itself out
• Queues came up closed
–
–
•
MAUI restarted
Service picked up smoothly
Lessons:
–
–
Know where the backups are and how to identify which tapes are the right ones
Unmodified batch workers are not good enough for system services
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Issues
• How to run resilient services on non-resilient hardware?
–
–
–
Committed to run 24x365, 98%+ uptime
Modified batch workers with extra disks and HS caddies as servers
Investigating HA-Linux
• Batch server and scheduling experiments positive
• RB,CE, BDII, R-GMA …
–
Databases
• Building services maintenance
–
Aircon, power
• Already two substantial shutdowns in 2006
• New building
• UKLight is a development project network
–
There have been problems with managing expectations for production services on
a development network
• Unresolved packet loss in CERN-RAL transfers
–
Under investigation
• 10Gb/s kit expensive
–
–
Components we would like are not yet affordable/available
Pushing against LCG turn-on date
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Questions?