The Grid for Industry - GridPP - UK Computing for Particle

Download Report

Transcript The Grid for Industry - GridPP - UK Computing for Particle

Tier-1 Status
Andrew Sansum
GRIDPP20
12 March 2008
Tier-1 Capacity delivered
to WLCG (2007)
RAL
RAL
CPU
Tier-1 CPU Share by 2007 MoU
FNAL
10%
CERN
18%
BNL
14%
CERN
Triumf
Triumf
1%
IN2P3
7%
IN2P3
FZK
CNAF
SARA
NDGF
PIC
RAL
7%
FZK
10%
ASGC
10%
PIC
3%
NDGF
4%
SARA
9%
CNAF
7%
ASGC
RAL
BNL
FNAL
Wall Time
CPU Use by VO (2007)
ATLAS
ALICE
CMS
LHCB
Experiment Shares (2008)
Approximate H.W Value Allocated to Experiments in 2008
Other
1%
LHCB
8%
Alice
4%
Alice
Atlas
Babar
CMS
31%
CMS
Atlas
53%
Babar
3%
LHCB
Other
Grid Only
• Non-Grid access to Tier-1 has now ended. Only special cases (contact us if
you believe you are) now have access to:
– UIs
– Job Submission
• Until end of May 2008
– IDs will be maintained (disabled)
– Home directories will be maintained online
– Mail forwarding will be maintained.
•
After end of May 2008
–
–
–
–
Ids will be deleted
Home filesystem will be backed up
Mail spool will be backed up
Mail forwarding will stop
• AFS service continues for Babar (and just in case)
Reliability
RAL-LCG2 Availability/Reliability
120%
100%
80%
Available
Old Reliability
New Reliability
60%
Target
Average
Best 8
40%
20%
0%
M Jun- Jul- A
S Oct- Nov- D Jan- F
M Apr- M Jun- Jul- A
S Oct- Nov- D Jan- F
ay- 06 06 ug- ep- 06 06 ec- 07 eb- ar- 07 ay- 07 07 ug- ep- 07 07 ec- 08 eb06
06 06
06
07 07
07
07 07
07
08
•Feb mainly due to power failure + 8 hours network
•Jan/December mainly CASTOR problems over Xmass
period (despite multiple callouts)
•Out of hours on-call will help but some problems
take time to diagnose/fix
Power Failure: Thursday 7th February 13:00
•
Work on power supply since December
–
–
–
–
•
•
First power interruption for over 3 years
Restart (Effort > 200 FTE hours)
–
–
–
–
•
Down to 1 transformer (from 2) for extended periods (weeks). Increased risk of disaster.
Single transformer running at max operating load
No problems until work finished and casing closed – control line crushed and power supply tripped.
Total loss of power to whole building
Most Global/National/Tier-1 core systems up by Thursday evening
Most of CASTOR and part of batch up by Friday
Remaining batch on Saturday
Still problems to iron out in CASTOR on Monday/Tuesday
Lessons
–
–
–
Communication was prompt and sufficient but ad-hoc
Broadcast unavailable as RAL run GOCDB (now fixed by caching)
Careful restart of disk servers slow and labour intensive (but worked) will not scale
See: http://www.gridpp.rl.ac.uk/blog/2008/02/18/review-of-the-recent-power-failure/
Hardware: Disk
• Production capacity: 138 Servers, 2800 drives, 850TB (usable)
• 1.6PB capacity delivered in January by Viglen
– 91 Supermicro 3U servers with dual AMD 2220E (2.8GHz) dual-core
CPUs, 8GB RAM, IPMI
• 1 x 3ware 4 port 9650 PCIe RAID controller with 2 x 250GB WD HDD
• 1 x 3ware 16 port 9650 PCIe RAID controller with 14 x 750GB WD HDD
–
91 Supermicro 3U servers with dual Intel E5310 (1.6GHz) quad-core
CPUs, 8GB RAM, IPMI
• 1 x 3ware 4 port 9650 PCIe RAID controller with 2 x 400GB Seagate HDD
• 1 x 3ware 16 port 9650 PCIe RAID controller with 14 x 750GB Seagate HDD
• Acceptance test running – scheduled to be available end of
March.
– 5400 spinning drives after planned phase out in April (expect drive
failure every 3 days)
Hardware: CPU
• Production about 1500KSI2K on 600 systems.
– Recently upgraded about 50% of capacity to 2GB/core
• Recent procurement (approximately 3000KSI2K - but
YMMV) delivered and under test
–
Streamline
• 57 x 1U servers (114 systems, 3 racks), each system:
• dual Intel E5410 (2.33GHz) quad-core CPUs
• 2GB/core, 1 x 500GB HDD
–
Clustervision
• 56 x 1U servers (112 systems, 4 racks), each system:
• dual Intel E5440 (2.83GHz) quad-core CPUs
• 2GB/core, 1 x 500GB HDD
Hardware: Tape
• Tape Drives
– 8 9940B drives
• Used on legacy ADS/dCache service – phase out soon
– 18 T10K tape drives and associated servers
delivered, 15 in production, remainder soon
• Planned bandwidth 50MB/s per drive
• Actual bandwidth (8-80MB/s) - a work in progress
• Media
– Approximately 2PB on site
Hardware: Network
CPUs +
Disks
CPUs +
Disks
ADS
Caches
3 x 5510
+ 5530
5510
5530
RAL
Site
RAL
Tier 2
N x 1Gb/s
2 x 5510
+ 5530
10Gb/s
Router
A
Force10 C300
Stack
8 slot Router
4 x Nortel 5530
(64*10Gb)
10Gb/s
bypass
OPN
Router
5 x 5510
+ 5530
CPUs +
Disks
Oracle
systems
Tier 1
Firewall
6 x 5510
+ 5530
CPUs +
Disks
10Gb/s
Site
Access
Router
1Gb/s
Lancaster
(test network)
10Gb/s
to CERN
10Gb/s
to SJ5
RAL links
implemented
Implement soon
never
Backplane Failures (Supermicro)
•3 servers “burnt out” backplane
•2 of which set off VESDA
•1 called out fire-brigade
•Safety risk assessment:
Urgent rectification needed
•Good response from
supplier/manufacturer
•PCB fault in “bad batch”
•Replacement nearly complete
Machine Rooms
•Existing Machine room
•Approximately 100 racks of equipment
•Getting close to power/cooling capacity
•New Machine Room
•Work still proceeding near to schedule
•800M**2 can accommodate 300 racks + 5 robots
•2.3MW Power/Cooling capacity (some UPS)
•Scheduled to be available for September 2008
CASTOR Memory Lane
Happy days!
4Q05
CASTOR1
tests OK
1Q06
2Q06
3Q06
4Q06 1Q07
2.1.2 bad
CASTOR2 Core Running
ATLAS on
Hard to install + dependencies
CASTOR 
Problems with
functionality and
Service stopped for
performance – it
extended upgrade
doesn’t work!
CMS on CASTOR for
CSA06. Encouraging.
Declare production
service.
2Q07
3Q07 4Q07
1Q08
2.1.3 good
but missing
functionality
CSA07 encouraging
OC Committees
note improvement
but concerned
2.1.4 upgrade Goes
well. Disk 1 support!
LHCB on
CASTOR 
CSA08
reasonably
successful
Growth in Use of CASTOR
Test Architecture
Name
Server
+vmgr
Oracle
stager
Oracle
NS+
vmgr
Oracle
DLF
Oracle
repack
stager DLF
LSF
repack
Shared
Services
Tape
Server
Oracle
stager
Oracle
DLF
stager
DLF+
LSF
Preproduction
Development
1 Diskserver
- variable
1 Diskserver
- variable
Name
Server
+vmgr
Oracle
NS+
vmgr
Oracle
stager
Shared
Services
Oracle
DLF
Oracle
repack
stager DLF
LSF
repack
Certification Testbed
1 Diskserver
- variable
Tape
Server
CASTOR Production Architecture
Name
Server 1
+vmgr
Oracle
NS+
vmgr
Oracle
Oracle
stager
DLF
stager
DLF
Name
Server 2
Shared
Services
Tape
Server
Tape
Server
Tape
Server
Oracle
stager
Oracle
DLF
Oracle
stager
Oracle
DLF
stager
DLF
stager
DLF
LSF
LSF
LSF
CMS Stager
Instance
Atlas Stager
Instance
LHCb Stager
Instance
Diskservers
Diskservers
Diskservers
Tape
Server
Tape
Server
Tape
Server
Oracle
DLF
Oracle
stager
Oracle
repack
stager DLF
LSF
repack
Repack and Small
User Stager Instance
1
Diskserver
Atlas Data Flow Model
D0T1
AOD2
D1T0
RAW
T0Raw
D1T1
RAW
D0T0
T0
ESD1/
AODm1/
TAG
Farm
RAW
AODm2/
TAG
T1’s
TAG/
AODm2
ESD2/
AODm2/
TAG
ESD1
AODm1/
TAG
Partner T1
RAW
T2
AODm2/
TAG
StripInput
simRaw
ESD
simStrip
ESD/
AODm/
TAG/
CMS Dataflow
All pools are disk0tape1
FarmRead
50 LSF Slots
Per server
Recall
Disk2Disk Copy
T0, T1 & T2
WanIn
8 LSF Slots
Per server
Disk2Disk Copy
WanOut
T1 & T2
16 LSF Slots
Per server
Batch Farm
CMS Disk Server Tuning: CSA06/CSA07
•Problem: network performance too low
•Increase default/maximum tcp window size
•Increase tcp ring buffers and tx queue
•Ext3 journal changed to data=writeback
•Problem: Performance still too low
•Reduce number of gridftp slots/server
•Reduce number of streams per file
•Problem: Phedex transfers now timeout
•Reduce FTS slots to match disk pools
•Problem: servers sticky or crash with OOM
•Limit total tcp buffer space
•Protect low memory
•Aggressive cache flushing
See: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Disk_Server_Tuning
3Ware Write Throughput
Agregate and Per Thread Write Throughput by Thread Count
160.0
140.0
120.0
MB/s
100.0
Agregate Mb/s
80.0
Per Thread
60.0
40.0
20.0
0.0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
CCRC08 Disk server Tuning
• Migration rate to tape very bad (5-10MB/s) when concurrent with writing
data to disk
– Was OK in CSA06 (50MB/server) – Areca servers
– 3Ware 9550 performance terrible under concurrent read/write (2MB/s read,
120MB/s write)
– 3Ware appears to prioritise writes
• Tried many tweaks, most with little success except
– Either: changing elevator to anticipatory
• Downside – write throughput reduced
• Good under benchmarking - testing in production this week
– Or: increasing block device read ahead
• Read throughput high but erratic under test
• But seems OK in production (30MB/server)
See:
http://www.gridpp.rl.ac.uk/blog/2008/02/29/3ware-raid-controllers-and-tape-migration-rates/
CCRC (CMS – WANIN)
300MB/s
Network In
Migration
queue
Phedex
Network Out
Tier-0 Rate
CPU
CCRC (WANOUT)
300MB/s
Network In
Before
Replication
Phedex
After
Replication
Network Out
CPU
CASTOR Plans for May CCRC08
• Still problems
– Optimising end to end transfer performance remains a
balancing act.
– Hard to manage complex configuration
• Working on
– Alice/xrootd deployment
– Preparation for 2.1.6 upgrade
– Installation of Oracle RACS (resilient Oracle services for
CASTOR)
– Provisioning and configuration management
dCache Closure
• Agreed with UB that we would give 6 months notice
before terminating dCache service
– dCache closure announced to UB to be May 2008
• ATLAS and LHCB working to migrate their data
– Migration slower than hoped
– Service much reduced in size now (10-12 servers remain)
and operational overhead much lower
• Remaining non-LHC experiments migration delayed
by low priority for non-CCRC work.
– Work on Gen instance of CASTOR will recommence shortly.
• Pragmatically – closure may be delayed by several
months until Minos and tiny VOs migrated
Termination of GRIDPP use of ADS Service
• GRIDPP funding and use of old legacy Atlas Datastore
service scheduled to end at end of March 2008.
– No gridpp access by “tape” command after this
– Also no access via C callable VTP interface
• RAL will continue to operate ADS service and
experiments are free to purchase capacity directly
from Datastore Team.
• Pragmatically closure cannot happen until:
– dCache ends (uses ADS back end)
– CASTOR is available for small VOs
– Probably 6 months away
Conclusions
• Hardware for 2008 MoU in the machine room and moving satisfactorily
through acceptance
– Volume not yet a problem but warning signs starting to appear.
• CASTOR situation continues to improve
– Reliable during CCRC08
– Hardware performance improving. Tape migration problem reasonably
understood and partly solved. Scope for further improvement
– Progressing various upgrades
• Remaining Tier-1 infrastructure essentially problem free.
• Availability fair, but stagnating. need to progress:
– Incident response staff
– On-Call
– Disaster Planning and National/Global/Cluster Resilience
• Concerned that we still not seen all experiment use cases.