Transcript Slide 1
CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010 Agenda • Testing – What we Planned, what we did and what the VOs are doing • • • • Results Issues Rollout Plan The Future Planned Testing • Original Plan – Test database Upgrade Procedure – Functional Test 2.1.7/8/9 – Stress Test 2.1.7/8/9 • 10K reads (1 file in, multiple reads) (rfio+gridFTP) • 10K writes (multiple files in)(rfio+gsiftp) • 10K d-2-d (1 file in, multiple reads) (rfio) • 20K read/write (rfio+gridFTP), 10K mixed tests • 10K stager_qry (database test) • 5 file sizes (100MB-2GB) Required Changes • Move to ‘local’ nameserver – Required to allow rolling updates • Nameserver schema can not be upgraded until all instances are at 2.1.9 • Move from SLC4 to SL4 – Support for SLC4 end this year • SL4 supported until 2012 • Change of diskservers part way through testing Actual Testing Stager Local Nameserver Central Nameserver Tests 2.1.7-27 - 2.1.8-3 FT/ST 2.1.7-27 2.1.7-27 2.1.8-3 FT 2.1.7-27 2.1.7-27 2.1.8-18 FT 2.1.8-18 2.1.8-18 2.1.8-18 FT/ST† 2.1.8-18 2.1.8-18 2.1.9-6* FT 2.1.9-6 2.1.9-6 2.1.9-6* FT 2.1.9-6 - 2.1.9-6 FT/ST • (*) Indicates a schema only upgrade; the rpm’s remained at the previous version • (†) Move from SLC4 to SL4 after stress testing Actual Stress Testing • Original plan for fix would have taken too long – Moved to fixed duration testing (24 hr limit) – Reduced number of file sizes from 5 to 2 • 100 MB and 2GB • No mixed tests Results • All 2.1.8 Functional Tests pass • Most 2.1.9 tests pass – With some modifications to scripts – Including xrootd! – Some fail because they require a CERN specific set up • Stable under stress testing – Changes made performance metrics less useful – Overall impression is no significant change Issues (on Testing) • Limit on clients – More stress on client machines than CASTOR – Unable to test extreme LSF queues – VO testing includes stress (hammercloud) tests • Functional tests done with ‘matching’ client version – Some basic testing also done with older client versions (2.1.7) against later stager versions. – VO’s using 2.1.7 clients Issues (on CASTOR) • Remarkably few.... – DLF not registering file id • Fixed by CERN – we need custom version of DLF.py – No 32-bit xroot rpms available • Produced for us, but not fully supported – gridFTP external (used @ RAL) does not support checksumming – Some database cleanup needed before upgrade Issues (VO Testing) • Some misconfigured disk servers • Problems with xrootd for ALICE – Disk servers need firewall ports opening. Issues (in 2.1.9-6) • Known issues affecting 2.1.9-6 – Rare checksum bug affecting gridFTP internal • Fixed in 2.1.9-8 – Can get file inconsistencies during repack if file is overwritten • Very unlikely (fixed in 2.1.9-7) – Xrootd manager core dumps at CERN • Under investigation – Problem with multiple tape copies on file update Change Control • Whole testing and rollout plan has been extensively change reviewed – Four separate reviews, some done independently of CASTOR team – Included review of Update Process – Provided useful input for additional tests and highlighted limitations, and identifying impacted systems – Proposed regular reviews during upgrades • Detailed update plan under development Rollout Plan • High level docs available for some time now: – https://www.gridpp.ac.uk/wiki/RAL_Tier1_Upgrad e_Plan • Three downtimes • Schedule to be agreed with VO’s – Proposed schedule sent to VO’s – Likely LHCb will be guinea pigs – ALICE before Heavy Ion run Schedule (draft) • Rolling move to local nameserver starting 13/9 • Main update: – – – – LHCb: 27/9 GEN(ALICE): 25/10 ATLAS: 8/11 CMS: 22/11 • Revert back to central n/s post Xmas The Future • More CASTOR/SRM upgrades – 2.1.9-8 to address known issues – 2.9 SRM more performant, safer against DoS • Move to SL5 – Probably next year; no rpm’s available yet • CASTOR gridFTP ‘internal’ • More use of xrootd • More stable database infrastructure (Q1 2011?) Facilities Instance • Provide CASTOR instance for STFC facilities – Provides (proven) massively scalable “back end” storage component of a deeper data management architectural stack – CASTOR for STFC facilities: production system to be deployed ~ Dec 2010 – STFC friendly users currently experimenting with CASTOR – Users expected to interface to CASTOR via “Storage-D” (High performance data management pipeline) – E-Science aiming for a common architecture for “big data management”: • • • • CASTOR Back end data storage Storage-D middleware ICAT file and meta-data catalogue TopCat – multi user web access – Can eventually wind down sterling, (but obscure) “ADS” service (very limited expertise, non Linux operating system, unknown code in many parts) – Exploits current (and future) skill set of the group Summary • New CASTOR was stable under stress testing – And VO testing – so far • Performance not impacted – probably • Very useful getting experiments on-board for testing. • ‘Ready’ for deployment Results (Stress Tests, 100MB) Test 2.1.7 2.1.8 2.1.9 Rfio write 76.3(+/3.92)s 82.7(+/-25.7)s 39.7(+/-22.3)s Rfio write+read 330.3(+/-107.1)s 10.5(+/-24.8)s 62.4(+/-18.0)s Disk-2-disk 59.7(+/-10.9)s 23.2(+/-14.5)s 68.6(+/-17.4)s gridFTP write 85.4(+/-10.5)s 43.3(+/-14.2)s 49.9(+/-73.0)s gridFTP write+read 27.9(+/-7.4)s 68.9(+/-18.9)s 72.5(+/-40.8)s Results (Stress Tests, 2GB) Test 2.1.7 2.1.8 2.1.9 Rfio write 16944.3(+/-286.4) 1699.8(+/-42.7) 736.8(+/-377.6) Rfio write+read 3409.9(+/-9.6) 380.6(+/-168.7) 1421.8(+/-597.7) Disk-2-disk 7605.3(+/-2317.7) 402.9(+/-175.6) 1295.9(+/-597.7) gridFTP write 1713.8(+/-19.8) 765.1(+/-83.2) 750.5(+/-223.0) gridFTP write+read 1630.3(+/-184.5) 803.9(+/-220.2) 1287.3(+/-638.0)