GridPP Deployment Status GridPP15 11th January 2006 Jeremy Coles [email protected] Overview 1 An update on some of the high-level metrics 2 Some new sources of information 3

Download Report

Transcript GridPP Deployment Status GridPP15 11th January 2006 Jeremy Coles [email protected] Overview 1 An update on some of the high-level metrics 2 Some new sources of information 3

GridPP Deployment Status
GridPP15
11th January 2006
Jeremy Coles
[email protected]
Overview
1 An update on some of the high-level metrics
2 Some new sources of information
3 General deployment news
4 Expectations for the coming months
5 Preparing for SC4 and our status
6 Summary
Prototype metric report
UKI is still contributing well but according to the SFT data our proportion of
sites failing is relatively high
Snapshot of recent numbers
Region
# sites
Average CPU
Asia Pacific
8
450
CERN
19
4250
Central Europe
15
30
France
8
1100
Germany & Switzerland
11
3365
Italy
25
2250
Northern Europe
10
1160
Russia
14
430
South East Europe
20
370
South West Europe
14
700
UKI
36
3250
Most of the unavailable sites have
been in Ireland as they make the
move over to LCG 2.6.0.
12/01/2005
11/16/05
11/01/2005
10/17/05
10/02/2005
09/17/05
09/02/2005
08/18/05
08/03/2005
07/19/05
07/04/2005
06/19/05
06/04/2005
05/20/05
05/05/2005
EGEE total job slots
04/20/05
04/05/2005
03/21/05
03/06/2005
02/19/05
02/04/2005
01/20/05
01/05/2005
12/21/04
12/06/2004
11/21/04
11/06/2004
10/22/04
10/07/2004
09/22/04
09/07/2004
08/23/04
08/08/2004
07/24/04
07/09/2004
06/24/04
Average job slots have
increased gradually
UK total job slots
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
UK job slots have increased by about 10% since GridPP14. (See Steve Lloyd’s talk
for how this looks against the targets)
Date
12/04/2005
11/18/05
11/02/2005
10/17/05
10/01/2005
09/15/05
08/30/05
08/14/05
07/29/05
07/13/05
06/27/05
06/11/2005
05/26/05
05/10/2005
04/24/05
04/08/2005
03/23/05
03/07/2005
02/19/05
02/03/2005
01/18/05
01/02/2005
12/17/04
12/01/2004
11/15/04
10/30/04
10/14/04
09/28/04
09/12/2004
08/27/04
08/11/2004
07/26/04
07/10/2004
06/24/04
Percentage contribution
Therefore our contribution to EGEE
CPU resources remains at ~20%
UK % total CPU
35.00%
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
06
/0
2/
2
06 00
4
07 /19
/
/0 0 4
6/
2
07 00
4
/
08 23
/0 /0 4
9/
2
08 00
4
09 /26
/
/1 0 4
2/
2
09 00
/2 4
10 9 /0
4
11 /16
/
/0 0 4
2/
2
11 00
4
/
12 19
/0 /0 4
6/
2
12 00
4
01 /23
/
/0 0 4
9/
2
01 00
5
/
02 26
/1 /0 5
03 2 /2
/0 005
1/
2
03 00
5
/
04 18
/0 /0 5
4/
2
04 00
5
05 /21
/
/0 0 5
8/
2
05 00
5
/
06 25
/1 /0 5
1/
2
06 00
/2 5
07 8 /0
5
08 /15
/0 /0 5
1/
2
08 00
5
09 /18
/
/0 0 5
4/
2
09 00
5
/
10 21
/0 /0 5
8/
2
10 00
5
11 /25
/
/1 0 5
1/
2
11 00
/2 5
8/
05
% job slots used
However there still is not
consistently high usage of job slots
% EGEE slots used
Date
% UK slots used
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
The largest GridPP
users by VO for 2005
BABAR
ATLAS
BIOMED
CMS
DZERO
ZEUS
LHCb
NB: Excludes data from Cambridge – for Condor support in APEL see Dave Kant’s talk
Storage has seen a healthy
increase – but usage remains low
At the GridPP Project Management and Deployment Boards yesterday we
discussed ways to encourage the experiments to make more use of Tier-2 disk
space – The Tier-1 will be unable to meet allocation requests. One of the
underlying concerns is what do data flags mean to Tier-2 sites.
Scheduled downtime
Views of data will be available from CIC portal from today! http://cic.in2p3.fr/
Scheduled downtime
Congratulations to Lancaster for being the only site to have no Scheduled Downtime
SFT review
It was probably clear already that the majority of our failures (and those of other large
ROCs) are lcg-rm (Failure points: replica catalog, configured BDII, CERN storage for
replication OR a local SE problem) and rmga (generally badly configured site). We know the
tests need to improve and become more reliable and accurate too!
Overall EGEE statistics
The same problems cover the majority of EGEE resources. Hours of impact will be
available soon and will help us evaluate the true significance of these results.
Completing the weekly
ALL site administrators have now been
reports
asked to complete information related to
problems observed at their sites as
recorded in the weekly operations report
This will impact our message at weekly EGEE operations reviews
(http://agenda.cern.ch/displayLevel.php?fid=258) and YOUR Tier-2
performance figures!
Performance measures
• The GridPP Oversight Committee has asked us to investigate why some sites
perform better than others. As well as looking at the SFT and ticket response data,
the Imperial College group will help pull data from their LCG2 Real Time Monitor
Daily Summary Reports: http://gridportal.hep.ph.ic.ac.uk/rtm/reports.html
EGEE metrics
• While we have defined GridPP metrics many are not automatically
produced. EGEE now has metrics as a priority and at EGEE3 a number
of metrics were agreed for the project and assigned.
SIZE
• # of sites in production
• # of job slots
• Total available kSpecInt
• Storage (disc)
• Mass storage
• # EGAP approved VOs
• # active VOs
• # active users
• Total % used resources
DEPLOYMENT
• Speed of m/w security
update
OPERATIONS
• Site responsiveness to COD
• Site response to tickets
• Site tests failed
• % availability of SE, CE
• # days downtime per ROC
SUPPORT
• User ticket response time
• Number of “supporters”
• # tickets escalated
• % tickets wrongly assigned
• ROC measures
USAGE
• Jobs per VO (submit, comp, fail)
• Data transfer per VO
• CPU and storage usage per VO
• % sites blacklisted/whitelisted
• # CE/SE available to VO
SERVICE NODES (testing)
• RB – submit to CE time
• BDII – query time average
• MyProxy – register/access/del
• SRM-SE – test file movement
• Catalogue test
• VOMS
• RGMA
The status of 2.7.0
• Mon 9th Jan - tag and begin local testing of installations and
upgrades on mini testbeds, complete documentation
• Mon 16th Jan - pre-release to >3 ROCs for a week of further testing
• Mon 23rd Jan - incorporate results of ROC testing and release asap
• Release 2.7.0 (at the end of January!?)
Expect
• Bug fixes – RB, BDII, CE, SE-classic, R-GMA, GFAL, LFC, SE_DPM
• VOMS – new server client version
• VO-BOX – various functionality changes
• LFC/DPM updates
• Lcg_utils/GFAL – new version & bug fixes
• RB – new functionality for job status checking
• Security issues – pool account recycling, signed rpm distribution
• FTS clients & dCache 1.6.6.2
• Some “VO management via YAIM” additions
Details of release work: https://uimon.cern.ch/twiki/bin/view/LCG/LCG-2_7_0
Outcomes of security
challenge
Comments (Thanks to Alessandra Forti)
•
Test suites should be asynchronous
•
Security contacts mail list is not up to date
•
4 sites CSIRTS did not pass on information – site security contacts should be
the administrators and not site CSIRTS
•
1 site did not understand what to do
•
Majority of sites acknowledged tickets within a few hours once site
administrator received ticket
•
On average sites responded with CE data in less that 2 days (some admins
were unsure about contacting the RB staff)
•
2 sites do not use lcgpbs jobmanager and were unable to find the
information in the log files (also 1 using Condor)
•
Some sites received more than one SSC job in 3 hr timeframe and were
unable to return an exact answer but gave several
•
Mistake in date – admins spotted inconsistencies
•
ROC struggled with ticket management and caused delays in processing
tickets!
Aside: The EGEE proposed Security Incident handbook is being reviewed by the
deployment team: http://wiki.gridpp.ac.uk/wiki/Incident_Response_Handbook
Other areas of interest!
•The Footprints version (UKI ROC ticketing system) will be upgraded on 23rd
January. This will improve our interoperations with GGUS and other ROCs (using
xml emails). There should be little observable impact but we do ask PLEASE SOLVE
& CLOSE as many currently open tickets as possible by 23rd January.
• Culham (the place which hosted the last operations workshop) will be adding a
new UKI site in the near future. They will join or host the Fusion VO.
• Most sites have now completed the “10 Easy Network Questions” responses.
http://wiki.gridpp.ac.uk/wiki/GridPP_Answers_to_10_Easy_Network_Questions
This has proved a useful exercise. What do you think?
• The deployment team has identified a number of operational areas to improve.
These include such things as experiment software installation, VO support
availability of certain information on processes (like where to start for new sites)
• Pre-production service: UKI now has 3 sites with gLite (components) either
deployed or in the process of being deployed
Other areas of interest!
•The Footprints version (UKI ROC ticketing system) will be upgraded on 23rd
January. This will improve our interoperations with GGUS and other ROCs (using
xml emails). There should be little observable impact but we do ask PLEASE SOLVE
& CLOSE as many currently open tickets as possible by 23rd January.
• Culham (the place which hosted the last operations workshop) will be adding a
new UKI site in the near future. They will join or host the Fusion VO.
• Most sites have now completed the “10 Easy Network Questions” responses.
http://wiki.gridpp.ac.uk/wiki/GridPP_Answers_to_10_Easy_Network_Questions
This has proved a useful exercise. What do you think?
• The deployment team has identified a number of operational areas to improve.
These
as experiment
software
installation,
support
REMINDERinclude
& REQUESTsuch
– Pleasethings
enable more
VOs! GridPP PMB requests
that 0.5%
(1% in EGEE-2)VO
resources
be used to support
wider
VOs
–
like
BioMed.
This
will
also
get
our
utilisation
higher.
Feedback
is
going
to
developers
on
making
addingsites)
VOs
availability of certain information on processes (like where to start for new
easier.
• Pre-production service: UKI now has 3 sites with gLite (components) either
deployed or in the process of being deployed
Our focus is now on
Service Challenge 4
A number of milestones (previously discussed at the 15th November UKI Monthly
Operations Meeting) have been set. Red in text means milestone at risk (generally
due to external dependencies) and Green text signifies done.
SRM
• 80% of sites have working (file transfers with 2 other sites successful) SRM by end
of December
• All sites have working SRM by end of January
• 40% of sites (using FTS) able to transfer files using an SRM 2.1 API by end February
• All sites (using FTS) able to transfer files using an SRM 2.1 API by end March
• Interoperability tests between SRM versions at Tier-1 and Tier-2s (TBC)
FTS Channels
• FTS channel to be created for all T1-T2 connections by end of January
• FTS client configured for 40% sites by end January
• FTS channels created for one Intra-Tier-2 test for each Tier-2 by end of January
• FTS client configured for all sites by end March
GridPP links, progress and status is being logged in the GridPP wiki: http://wiki.gridpp.ac.uk/wiki/Service_Challenges
Core to these are the…
Data Transfers
Tier-1 to Tier-2 Transfers (Target rate 300-500Mb/s)
•Sustained transfer of 1TB data to 20% sites by end December
•Sustained transfer of 1TB data from 20% sites by end December
•Sustained transfer of 1TB data to 50% sites by end January
•Sustained transfer of 1TB data from 50% sites by end January
•Sustained individual transfers (>1TB continuous) to all sites completed by mid-March
•Sustained individual transfers (>1TB continuous) from all sites by mid-March
•Peak rate tests undertaken for all sites by end March
•Aggregate Tier-2 to Tier-1 tests completed at target rate (rate TBC) by end March
Inter Tier-2 Transfers (Target rate 100 Mb/s)
•Sustained transfer of 1TB data between largest site in each Tier-2 to that of another
Tier-2 by end February
•Peak rate tests undertaken for 50% sites in each Tier-2 by end February
The current status
Receiving
RAL Tier-1
RAL Tier-1
Lancaster
Lancaster
Manchester
Edinburgh
~800Mb/s
350Mb/s
156Mb/s
Glasgow
IC-HEP
RAL-PPD
84Mb/s
309 Mb/s
397 Mb/s
0 Mb/s
Manchester
Edinburgh
422Mb/s
Glasgow
331Mb/s
210 Mb/s
224 Mb/s
122 Mb/s
IC-HEP
RAL-PPD
NEXT SITES:
London – RHUL & QMUL
ScotGrid – Durham
SouthGrid – Birmingham & Oxford?
NorthGrid – Sheffield? & Liverpool
KEY:
Black figures indicate 1TB transfer
Blue figures indicate <1TB transfer (eg. 10 GB)
http://wiki.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Tests
Additional milestones
LCG File Catalog
• LFC document available by end November
• LFC installed at 1 site in each Tier-2 by end December
• LFC installed at 50% sites by end January
• LFC installed at all sites by end February
• Database update tests (TBC)
VO Boxes
•Depending on experiment responses to security and operations questionnaire
and GridPP position on VO Boxes.
•VOBs available (for agreed VOs only) for 1 site in each Tier-2 by mid-January
•VOBs available for 50% sites by mid-February
•VOBs available for all (participating) sites by end March
Experiment Specific Tests (TBC)
•To be developed in conjunction with experiment plans – Please make
suggestions!
Additional milestones
LCG File Catalog
• LFC document available by end November
• LFC installed at 1 site in each Tier-2 by end December
• LFC installed at 50% sites by end January
• LFC installed at all sites by end February
• Database update tests (TBC)
VO Boxes
•Depending on experiment responses to security and operations questionnaire
LHCb & ALICE questionnaires received. Accepted and VO boxes deployed at Tier-1.Little
and
GridPP position on VO Boxes.
use so far – ALICE has not had a disk allocation.
•VOBs
available
(for agreed
only) for
1 have
site in
each
Tier-2
by mid-January
ATLAS original
response
was notVOs
accepted.
They
since
tried
to implement
VO
•VOBs
available
for 50% so
sites
boxes and
found problems
are by
nowmid-February
looking at a centralised model.
•VOBs
forBoxes
all (participating)
siteslocal
by end
March processes
CMS do available
not have VO
but they DO require
VO persistent
Experiment Specific Tests (TBC)
•To be developed in conjunction with experiment plans – Please make
suggestions!
Getting informed &
involved!
The deployment team are working to make sure sites have sufficient information. Coordinate
your activities with your Tier-2 Coordintor.
1)Stay up to date via the Storage Group work: http://wiki.gridpp.ac.uk/wiki/Grid_Storage
2) General Tier-1 support: http://wiki.gridpp.ac.uk/wiki/RAL_Tier1
3) Understand and setup FTS (channels):
http://wiki.gridpp.ac.uk/wiki/RAL_Tier1_File_Transfer_Service
4) VO Boxes go via Tier-1 first: http://wiki.gridpp.ac.uk/wiki/VOBox
5) Catalogues (& data management): http://wiki.gridpp.ac.uk/wiki/Data_Management
The status of sites is being tracked here:
http://wiki.gridpp.ac.uk/wiki/Service_Challenge_4_Site_Status
Some particular references worth checking out when taking the next step:
6) What RALPP did to get involved:
http://wiki.gridpp.ac.uk/wiki/RALPP_Local_SC4_Preparations
8) Edinburgh dCache tests: http://wiki.gridpp.ac.uk/wiki/Ed_SC4_Dcache_Tests
9) Glasgow DPM testing: http://wiki.gridpp.ac.uk/wiki/Glasgow
PLEASE CREATE SITE TESTING LOGS – it helps with debugging and information sharing
Summary
1 Metrics show stability and areas where we can improve
2 EGEE work will add to information which is published & analysed
3 GridPP & experiments need to work at better use of Tier-2 disk
4 There are changes coming with 2.7.0 & helpdesk upgrade
5 Focus has shifted to Service Challenge work (including security)
6 Sites asked to complete reports, reduce tickets & get involved in SC4!