GridPP Overview Tony Doyle 4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow.

Download Report

Transcript GridPP Overview Tony Doyle 4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow.

GridPP
Overview
Tony Doyle
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Contents
• Technical Design Reports
• Timescales
• Oversight Committee Summary
– Current concerns
– Actions (and how these were addressed)
– Feedback from the July 1 (OC7) meeting
• “Get Fit” Plan and Problem Solving
• Beyond GridPP2..
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
June Reports
Computing Technical Design Reports:
http://doc.cern.ch/archive/electronic/cern/ preprints/lhcc/public/
ALICE: lhcc-2005-018.pdf
ATLAS: lhcc-2005-022.pdf
CMS: lhcc-2005-023.pdf
LHCb: lhcc-2005-019.pdf
LCG: lhcc-2005-024.pdf
LCG Baseline Services Group Report:
http://cern.ch/LCG/peb/bs/BSReport-v1.0.pdf
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Timescales
• Service Challenges – UK deployment plans
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Functionality
Requirement
OMII
VDT/GT
Storage Element
Basic File Transfer
Yes
GridFTP
Reliable File Transfer
Catalogue Services
Data Management tools
Compute Element
Workload Management
RLS
OMII Data Service (upload / download
)
LCG/gLite
Other
Yes
SRM via dCache, DPM
or CASTOR
Yes
File Transfer Service
FTS is built on top of GridFTP
LCG File Catalogue,
gLite FireMan
Central catalogues adequate, high throughput
needed
LCG tools
(replica management,
etc.)
OMII Job Service
Gatekeeper
Yes
Manual resource allocation & job
submission
Condor-G
Resource Broker
gLite File Placement Service under development
LCG uses Globus with mods
RB builds on Globus, Condor-G
Perform localised activities on behalf of VO
Tools for account management, no
GridMapFile equivalent
CAS
VOMS
DataBase Services
CAS does not provide all the needed functionality
MySQL, PostgreSQL,
ORACLE
Posix-like I/O
GFAL, gLite I/O
Application Software
Installation Tools
Job Monitoring
Yes
Monalisa,
Netlog
ger
4 July 2005
Off–the-shelf offerings are adequate
Xrootd
Tools already exist in LCG-2
e.g. PACMAN
Logging &
Bookkeeping
service, RGMA
Reliable Messaging
Information System
LCG includes Storage Resource Management
capability
LCG includes GridFTP
VO Agents
VO Membership Services
Comment
Tools such as Jabber are used by experiments
(e.g. DIRAC for LHCb)
MDS
(GLUE)
Yes
GridPP13 Collaboration Meeting
BDII
LCG based on BDII and GLUE schema
Tony Doyle - University of Glasgow
July Documents
PPARC Oversight Committee Papers
Seventh GridPP Oversight Committee (July 2005)
Executive Summary
Project Map
Link to Project Map Database (Excel) Version (v2)
Resource Report
LCG Report
EGEE Report
Deployment Report
Middleware/Security/Network Report
Applications Report
User Board Report
Tier-1/A Report
Tier-2 Report
Dissemination Report
UK Analysis
Metrics and Deployment
Middleware Planning
Experiment engagement questionnaire
See http://www.gridpp.ac.uk/docs/oversight/
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Exec2 Summary
• GridPP2 has already met 21% of its original targets with
86% of the metrics within specification
• “Get fit” plan described (requested by OC)
• gLite 1 was released in April as planned but components
have not yet been deployed or their robustness tested by
the experiments
• Service Challenge (SC) 2 addressing networking was a
success at CERN and the Tier-1
• SC3 addressing file transfers for the experiments is about
to commence
• Long-term concern: hardware at the Tier-1 in 2007-08
• Short-term concerns: under-utilisation of resources and
the deployment of Tier-2 resources
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
RAL joins labs worldwide in
successful Service Challenge 2
• The GridPP team at Rutherford Appleton
Laboratory (RAL) in Oxfordshire recently
joined computing centres around the world
in a networking challenge that saw RAL
transfer 60 terabytes of data over a tenday period. A home user with a 512 kilobit
per second broadband connection would be
waiting 30 years to complete a download of
the same size.
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
gLite 1
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
100 green sites
sitting on a grid
• Thu 16 Jun 2005
• Last week the UK CIC-on-duty team
celebrated the milestone of having 100
sites passing the Sites Functional Test.
Thanks to all the sites who acted promptly
to trouble tickets raised by the UK team
during their shift.
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Current concern
1. under-utilisation
Non-Grid
Grid
4 July 2005
GridPP13 Collaboration Meeting
• Under utilisation of
existing
Tier-1/A
resources
• improving
overall and
w.r.t. Grid
fraction from
2004 to 2005
Tony Doyle - University of Glasgow
Current concern
2. under-delivery
Brunel
Imperial
QMUL
RHUL
UCL
Lancaster
Liverpool
Manchester
Sheffield
Birmingham
Bristol
Cambridge
Oxford
RAL PPD
Warwick
Durham
Edinburgh
Glasgow
London
NorthGrid
ScotGrid
SouthGrid
Total
4 July 2005
CPU KSI2K
Promised Delivered
4
30
384
420
247
317
167
204
108
60
101
510
0
605
65
1305
29
183
103
136
38
39
12
33
102
119
99
98
0
0
15
86
5
7
1
246
1031
2602
340
425
4397
910
195
22
354
1481
17%
74%
1%
Promised
21
28
29
23
1
87
80
373
3
9
2
4
19
12
0
5
71
15
Disk TB
Delivered
0
0
25
6
1
2
0
9
3
9
2
4
19
6
0
5
1
3
88%
7%
6%
83%
34%
102
543
90
46
781
32
14
9
39
93
Ratio
13%
91%
78%
82%
180%
20%
0%
5%
16%
76%
98%
38%
85%
101%
Ratio
0%
1%
88%
24%
111%
2%
0%
2%
100%
97%
99%
80%
100%
51%
100%
1%
17%
31%
3%
9%
85%
12%
GridPP13 Collaboration Meeting
• The current
situation is
somewhat better
than these 2005
Q1 numbers
indicate
• Some late
procurements
(OK given underutilisation)
• Technical
problems (being
overcome)
Tony Doyle - University of Glasgow
Longer-Term
concern: allocations
2004
Alloc.
2005
2006
2007
2008
2009
2010
Disk
CPU
Tape
Disk
CPU
Tape
Disk
CPU
Tape
Disk
CPU
Tape
Disk
CPU
Tape
Disk
CPU
Tape
Disk
CPU
Tape
TB
KSI2k
TB
TB
kSI2k
TB
TB
kSI2k
TB
TB
KSI2k
TB
TB
KSI2k
TB
TB
kSI2k
TB
TB
kSI2k
TB
ALICE
5
14
4
1
1
1
10
24
10
13
24
13
26
48
26
46
84
46
80
147
80
ATLAS
27
400
0
68
400
14
257
529
150
508
801
377
887
1571
1033
1249
2593
2026
1892
3504
2790
CMS
40
86
50
80
200
206
74
205
400
128
283
483
227
449
670
343
661
1148
503
916
1663
LHCb
15
90
15
25
50
30
108
222
104
191
384
207
343
644
346
450
868
714
545
1290
1178
TOTAL
191
796
239
298
1282
331
604
1604
891
1100
2167
1280
1945
3891
2316
2633
5516
4130
3641
7358
5944
LHC
TOTAL
87
590
69
174
651
250
450
980
664
841
1492
1080
1484
2712
2074
2087
4206
3934
3020
5857
5710
LHC
Fraction
46%
74%
29%
58%
51%
76%
74%
61%
74%
76%
69%
84%
76%
70%
90%
79%
76%
95%
83%
80%
96%
Starting point: fair shares input to BaBar and LHC MoUs
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
06
/0
2/
2
06 00
/1 4
06 5 /0
07 /28 4
/1 /0
1/ 4
2
07 00
08 /24 4
/0 /0
6/ 4
2
08 00
09 /19 4
/0 /0
1/ 4
2
09 00
/1 4
09 4 /0
10 /27 4
/1 /0
0/ 4
2
10 00
11 /23 4
/0 /0
5/ 4
2
11 00
12 /18 4
/0 /0
1/ 4
2
12 00
/1 4
12 4 /0
01 /27 4
/0 /0
9/ 4
2
01 00
02 /22 5
/0 /0
4/ 5
2
02 00
03 /17 5
/0 /0
2/ 5
2
03 00
/1 5
03 5 /0
/
04 28 5
/1 /0
0/ 5
2
04 00
05 /23 5
/0 /0
6/ 5
2
05 00
06 /19 5
/0 /0
1/ 5
20
05
• Disk utilisation is
climbing (but very low)
4 July 2005
TB disk storage used
18
16
GridPP13 Collaboration Meeting
06/01/2005
05/18/05
05/04/2005
04/20/05
04/06/2005
03/23/05
03/09/2005
02/23/05
02/09/2005
01/26/05
01/12/2005
12/29/04
12/15/04
90.00%
12/01/2004
11/17/04
11/03/2004
10/20/04
10/06/2004
09/22/04
09/08/2004
08/25/04
08/11/2004
07/28/04
07/14/04
06/30/04
06/16/04
06/01/2005
05/19/05
05/06/2005
04/23/05
04/10/2005
03/28/05
03/15/05
03/02/2005
02/17/05
02/04/2005
01/22/05
01/09/2005
12/27/04
12/14/04
12/01/2004
11/18/04
11/05/2004
10/23/04
10/10/2004
09/27/04
09/14/04
09/01/2004
08/19/04
08/06/2004
07/24/04
07/11/2004
06/28/04
06/15/04
06/02/2004
Percentage contribution
• GridPP is a significant
contributor to EGEE
(20%)
06/02/2004
• CPU utilisation is low
% job slots used
Metrics and
Deployment
35.00%
30.00%
25.00%
20.00%
15.00%
UK % total CPU
10.00%
5.00%
0.00%
80.00%
Date
70.00%
60.00%
50.00%
% EGEE slots used
40.00%
30.00%
% UK slots used
20.00%
10.00%
0.00%
14
Date
12
10
8
UK storage used
6
4
2
0
Date
Tony Doyle - University of Glasgow
• Reflects systematic
approach and measurable
improvements in
deployment
4 July 2005
/0
31 1/2
/0 00
07 1/2 5
/0 00
14 2/2 5
/0 00
21 2/2 5
/0 00
28 2/2 5
/0 00
07 2/2 5
/0 00
14 3/2 5
/0 00
21 3/2 5
/0 00
28 3/2 5
/0 00
04 3/2 5
/0 00
11 4/2 5
/0 00
18 4/2 5
/0 00
25 4/2 5
/0 00
02 4/2 5
/0 00
09 5/2 5
/0 00
16 5/2 5
/0 00
23 5/2 5
/0 00
30 5/2 5
/0 00
06 5/2 5
/0 00
6/ 5
20
05
GridPP13 Collaboration Meeting
06/06/2005
30/05/2005
23/05/2005
16/05/2005
09/05/2005
02/05/2005
25/04/2005
18/04/2005
11/04/2005
04/04/2005
28/03/2005
21/03/2005
14/03/2005
07/03/2005
28/02/2005
21/02/2005
14/02/2005
07/02/2005
31/01/2005
24/01/2005
• gradual improvement in
site configuration and
stability
gstat metric value
Sites at release
• Sites upgrade
improvements – quarterly
upgrades within 3 weeks
24
Metrics and
Deployment
25
20
15
LCG-2_4_0
LCG-2_3_1
10
LCG-2_3_0
Sites
5
0
45
Date
40
35
30
25
EGEE
20
GridPP
15
10
5
0
Date
Tony Doyle - University of Glasgow
GridPP Deployment Status
2/7/05 (9/1/05)
totalCPU
2966
Total
(2029)
4 July 2005
freeCPU
runJob
waitJob
seAvail TB
seUsed TB
1666
(1402)
843
31
74.28
16.54
(95)
(480)
(8.69)
(4.55)
GridPP13 Collaboration Meeting
maxCPU
avgCPU
3145
(2549)
2802
(1994)
Tony Doyle - University of Glasgow
Actions
GridPP to submit the proposal for LCG phase 2 funding to the Committee prior to its
submission to Science Committee (minute 4.9).
• Done. 27 page report inc. input from OC at
http://www.gridpp.ac.uk/docs/gridpp2/SC_GridPP2_LCG_1.0.doc unfunded
GridPP to clarify the situation with regard to ATLAS production run tests for the next
physics workshop (minute 5.3).
• See News Item http://www.gridpp.ac.uk/news/-1119651840.463358.wlg
• (and slide)
GridPP to provide an update on progress resolving problems caused by mismatches
between local batch systems and the capabilities of the grid Resource broker
(minute 6.3).
• (See slide)
GridPP to more fully document its alignment with each of the individual experiments
(minute 15.2).
• An experiment engagement questionnaire has been used (initial input in
February and further [updated] input in June). See
http://www.gridpp.ac.uk/eb/workdoc/gridusebyexpts_0605.doc
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
ATLAS steps up Grid production
In a large-scale exercise in the weeks leading up to the
workshop, about 8.5 million Monte Carlo simulated events
were produced on the Grid. The events were produced
using three Grids: LCG, Grid3 in the US and NorduGrid.
Of the 65% produced on LCG, approximately one sixth
used GridPP resources in the UK. A total of 573,315 jobs
were run. On the best day 13k jobs ran, corresponding to a
production rate of 7.5Hz.
In parallel with all the major computing developments on
the Grid, first real cosmic data has now been taken with
the ATLAS detector in situ - see image. This is the first
trickle in what will eventually become a torrent of data.
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
RB Action
GridPP to provide an update on progress resolving problems caused by
mismatches between local batch systems and the capabilities of the
grid Resource broker (minute 6.3).
• The problem of connecting the local CE to a batch queue is largely
overcome – many (all shared) sites now do this.
• There were problems subsequently deploying the accounting system
(APEL) to point to the local batch system.
• Overcome (13 ex 18 sites), but not as straightforward as it could be.
• The JDL from the job is not passed to the local system. Hence there
is no way for the local scheduler to use info from the Grid scheduler.
• This is a limitation from a (shared) site viewpoint (attempting to
balance Grid and local jobs).
• The short term solution is to set up separate batch queues.
• It is not a limitation for the experiments (affects efficiency).
• It is noted as a requirement and it is intended that this will be
delivered in Year 2 of JRA1 for the WMS.
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Actions
GridPP to define its usage policy with respect to Tier-1 allocations (minute 15.4).
• See http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-57-Tier1A_1.0.doc
and documents within (“fair shares” using PPARC Form X information)
GridPP to produce an updated risk register (minute 15.5).
• Incorporated in the new Project Map at (with 7 “high” risks)
http://www.gridpp.ac.uk/pmb/ProjectManagement/GridPP2_ProjectMap_2.htm
GridPP to produce a “get-fit” plan for production metrics (minute 15.6).
• See Metrics and Deployment document
http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-64-Metrics.doc
and its incorporation into the Project Map
GridPP to define its metrics for job success (minute 15.7).
• Adopted EGEE-wide definition at
http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php
(See slides)
GridPP to produce a statement of intent regarding its adoption of gLite (minute 15.8).
• See Middleware Selection document
http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-65-Middleware.doc
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Metrics Action
GridPP to define its metrics for job success (minute 15.7).
•
GridPP adopts the EGEE-wide definition at
http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php
The (web-based) QA system accounts for Workload Management System registered job
successes (that can then be categorised by Virtual Organisation or Resource Broker)
Before introducing the figures it should be understood that there are caveats:
•
It only measures what the WMS “sees”
–
–
–
–
–
doesn't catch failure of WMS to register job in the first place (but this is a rare occurrence)
if a job half way through the script fails (for example tries but fails to copy a file) but the
script completes successfully then WMS sees everything as OK.
If a VO (e.g. LHCb) deploys an agent then the WMS only registers the success of the initial
(python) script: strategy enables higher overall LHCb performance (combined push-PULL
model). (This currently leads to other problems in overall accounting should contention
become an issue).
Overall: an end user may see either:
1. a worse efficiency
•
–
2. a better efficiency by
•
•
•
failed job for other hidden e.g. data management problems
choosing selected sites according to the Site Functional Test performance index;
deploying an agent to initiate real jobs at sites where the agent succeeded.
Physicists are “smart” and now “see” > 90% efficiency but the definition here is one
defined within a given VO adopting their own methods (and from informed input from
people currently submitting jobs to the system).
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Overview
Integrated over all VOs and RBs for first half
of 2005
Successes/Day
Success %
13806
64%
• Key point: Improving from 42% to 78%
during 2005
[For the UK RB (lcgrb01.gridpp.rl.ac.uk)
Successes/Day
319
Success %
69% ]
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
LHC VOs
ALICE
Successes/Day N/A
Success %
42%
4 July 2005
ATLAS
2796
83%
CMS
452
61%
GridPP13 Collaboration Meeting
LHCb
3463
68%
Tony Doyle - University of Glasgow
Other VOs
BaBar
Successes/Day 37
Success %
76%
4 July 2005
CDF
1
30%
D0
207
84%
GridPP13 Collaboration Meeting
BioMed
1074
76%
Tony Doyle - University of Glasgow
Interlude..
Angels & Demons introduces the character of Robert
Langdon, professor of religious iconology and art
history at Harvard University. As the novel begins, he's
awakened in the middle of the night by a phone call
from Maximilian Kohler, the director of CERN, the
world's largest scientific research facility in Geneva,
Switzerland. One of their top physicists, Lenoardo Vetra,
had been murdered, with his chest branded with the
word "Illuminati.”
Lenoardo Vetra created antimatter in canisters to
simulate the Big Bang. Vetra's murder, though, allows
one of the canisters to be stolen. Langdon and Vittoria
Petra are quickly sent off to Rome and Vatican City, to
help find the canister and return it to CERN before it
explodes at midnight...
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Agents and Daemons
The jobs that are sent to the LCG-2 Resource Broker (RB)
do not contain any particular LHCb job as payload, but are
only executing a simple script, which downloads and
installs a standard DIRAC agent. Since the only
environment necessary for the agent to run is the Python
interpreter, this is perfectly possible on all the LCG sites.
This pilot-agent is configured to use the hosting Worker
Node (WN) as a DIRAC CE. Once this is done, the WN is
reserved for the DIRAC WMS and is effectively turned
into a virtual DIRAC production site for the time of
reservation.
This way allowed for efficient use of the LCG resources
during the DC 2004 (over 5000 concurrent jobs at peak)
with a low effective failure rate, despite the rather high
intrinsic failure rate of LCG (about 40%).
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
The future for the
experiments?
The technologies used in this production are based on C++
(LHCb software), Python (DIRAC tools), Jabber/XMPP
(instant messaging protocol used for reliable
communication between components of the central
services) and XML-RPC (the protocol used to
communicate between jobs and central services). ORACLE
and MySQL are the two databases behind all of the
services. ORACLE was used for the production and
bookkeeping databases, and MySQL for the workload
management and AliEn FC systems.
This way allowed for efficient use of the LCG resources
during the DC 2004 (over 5000 concurrent jobs at peak)
with a low effective failure rate, despite the rather high
intrinsic failure rate of LCG (about 40%).
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
OC Preliminary
Feedback
ALL earlier actions were considered as “done” from OC perspective
GridPP to investigate alternative procurement strategies in
order to improve Tier-1/A utilisation
Actions:
Tier-1/A Board
I.
evaluate alternative approaches
User Board – THIS MEETING
II.
improve experiment estimates
GridPP to associate more resources for technical documentation
(for end users and system administrators)
Actions:
•
Internal advertising: is anyone within GridPP willing/able to take up the role
of “Documentation Officer”?
•
(There will be an incentive for this)
•
If this fails, to advertise the post using role description (being drafted)
Deployment Board – THIS MEETING
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
OC Preliminary
Feedback
• GridPP to develop a deployment model that works for
smaller T2 centres in association with CERN
• GridPP to provide a gap analysis for LCG
(using the baseline services and the [classified] experiment
components as described in the TDRs)
• GridPP to address UB questionnaire outcomes
(perceptions as well as actual shortcomings)
• GridPP to document the high-level "value" GridPP is
adding/delivering
(using Project Map)
• OC8 in February 2006 “important” (not “G8 on Wednesday”)
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
The “Get Fit” Plan

Production Grid Metrics
0.100 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 0.109 0.110 0.111 0.112 0.113 0.114 0.115 0.116
0.117 0.118 0.119 0.120 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133
0.134 0.135 0.136 0.137 0.138 0.139 0.140 0.141 0.142 0.143 0.144 0.145 0.146 0.147
• Set SMART (Specific Measurable
Achievable Realistic Time-phased)
Goals
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
“I take it plea
bargaining
is out of the
question?”
• See Dave’s
talk
4 July 2005
0
Owner:
Number
0.100
0.101
0.102
0.103
0.104
0.105
0.106
0.107
0.108
0.109
0.110
0.111
0.112
0.113
0.114
0.115
0.116
0.117
0.118
0.119
0.120
0.121
0.122
0.123
0.124
0.125
0.126
0.127
0.128
0.129
0.130
0.131
0.132
0.133
0.134
0.135
0.136
0.137
0.138
0.139
0.140
0.141
0.142
0.143
0.144
0.145
0.146
0.147
GridPP13
Production Metrics
Jeremy Coles
Title
Status Date
31-Mar-05
Due Date
Status
OK
OK
OK
OK
NOT OK
NOT OK
NOT OK
OK
NOT OK
OK
OK
NOT OK
NOT OK
NOT OK
NOT OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
NOT OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
NOT OK
OK
OK
OK
OK
Fraction of UK sites in Production
On going
Number of registered users
On going
Number of active users
On going
Number of supported VOs
On going
Number of LCG/EGEE Job Slots Published by UK
On going
Fraction of LCG/EGEE Jobs Slots Used
On going
GridPP KSI2K Available
On going
GridPP KSI2K Available to EGEE/LCG
On going
GridPP disk storage available
On going
GridPP disk storage available to LCG/EGEE
On going
GridPP Tape storage available
On going
GridPP Tape storage available to LCG/EGEE.
On going
Fraction of available KSI2K used in quarter
On going
Fraction of available Disk used in quarter
On going
Fraction of available Tape used in quarter
On going
Number of sites publishing LCG accounting data
On going
Percentage of total jobs run via the Grid
On going
Job failure rates
UK contribution to LHC experiments
UK contribution to non-LHC experiments
T1 participation in GOC service challenges
On going
T2s participation in GOC service challenges
On going
GridPP participating in EGEE security challenges
On going
T1 participating in 3D database phases
On going
GridPP security audit
On going
UB schedule implemented and upheld
On going
T1 support for GOC
T1 meeting pre-production service commitments
T1 meeting JRA1 commitments
On going
T1 meeting "other" user commitments
GridPP LCG middleware testbed operational
On going
Tier-1 service disaster recovery plans up to date
On going
Production service risks and issues log available and up to date
On going
Deployment team meetings
On going
UK wide deployment support active
On going
Quarterly operational performance review
On going
Tier-1 delivering to LCG MoU
On going
Tier-2s delivering to LCG MoU
On going
Site operating system upgrades
On going
GridPP deployment web-pages up-to-date
On going
Training needs addressed
On going
GridPP helpdesk functioning adequately
Fraction of Site Functional Tests passed over the last quarter
On going
Accumulated scheduled downtime in last quarter
On going
Average number of sites per quarter available in VO selections (N/a)
Number of GridPP (site) system security incidents in the last quarter
On going
Collaboration
Meeting
Tony
Doyle - University
of
Number of EGEE Grid security incidents in the last
quarter
On going
Sites comply with LCG/EGEE security policy
On going
Glasgow
Our 14 problems…
–
–
–
–
–
–
–
–
–
–
–
–
–
–
0.104: Number of LCG/EGEE job slots published by the UK. The current total is 2477 and the target was 3000.
0.105: Number of LCG/EGEE jobs slots used. The current fraction is 19% compared to a target of 70%. This
demonstrates that 0.104 above is clearly not an issue but that usage is presently low.
0.106: GridPP KSI2K available: By the end of March 2005 the combined Tier-1 and Tier-2 CPU power was
expected to be 5184 KSI2K compared to 2277 KSI2K achieved. This number is dominated by the 4397 KSI2K
expected from the Tier-2s which has been slowly becoming available.
0.108: GridPP disk storage available: Similar to 0.106 above. Only 280TB available compared to 968TB
anticipated but the situation is improving.
0.111: GridPP tape storage made available to LCG/EGEE. At present the tape storage is being used but not
really via the Grid route.
0.112: Fraction of available KSI2K used in quarter: at present a rough estimate shows about 42% of the
available CPU was used compared to a target value of 70%.
0.113: Fraction of available disk used in quarter: This is estimated at 64% compared to the target of 70%.
0.114: Fraction of available Tape used in quarter: This is estimated at 61% compared to the target of 70%.
0.131: Tier-1 service disaster recovery plans up to date: This has not been updated within the last 6 months.
0.143: Accumulated scheduled downtime in the last quarter: The current value of 418 days is almost identical
to the current) target of 411 days. The metric expects the 25% figure to reduce to 5% by the third year.
3.6.3: LCG Deployment evaluation reports: first report due in March 05 was delayed to the second quarter.
5.2.4. Tier-2 Hardware realisation: This flags the same issue as 0.106 and 0.108 above. Tier-2 hardware has
been delayed but the situation is improving.
5.2.7 Quarterly reports received within 1 month of the end of the quarter: The 05Q1 reports were received late.
Some of the delay was due to the unfortunate timing of EGEE meetings.
6.2.11: Non-HEP applications tested on the GridPP Grid (submitted via the NGS submission mechanism). The
NGS submission mechanism is not yet adequate.
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
The “Get Fit” Plan
• … not (yet) “The Final Solution”
• We hope this drives the right behaviour
• Plea bargaining is (probably) OK..
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Some Problem
Solving Strategies
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Beyond GridPP2..
LHC EXPLOITATION PLANNING REVIEW
Input is requested from the UK project spokespersons, for
ATLAS and CMS for each of the financial years 2008/9 to
2011/12, and for LHCb, ALICE and GridPP for 2007/8 to 2011/12.
Physics programme
Please give a brief outline of the planned physics programme.
Please also indicate how this planned programme could be
enhanced with additional resources. In total this should be no
more than 3 sides of A4. The aim is to understand the
incremental physics return from increasing resources.
Input will be based upon PPAP roadmap input
E-Science and LCG-2 (26 Oct 2004)
and feedback from CB (12 Jan & 7 July 2005)
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Problem Solving and
Improved Communication
• “Communication, in essence, is the shift of a particle from one part
of space to another part of space. A particle is the thing being
communicated. It can be an object, a written message, a spoken
word or an idea. In its crudest definition, this is communication.
• This simple view of communication leads to the full definition:
• Communication is the consideration and action of impelling an
impulse or particle from source-point across a distance to receiptpoint, with the intention of bringing into being at the receipt-point
a duplication and understanding of that which emanated from the
source-point..”
• from The Scientology Handbook
This may be a clue to
how we will overcome
our problems
But we can always
improve this..
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow
Summary
•
•
•
•
•
LHC Technical Design Reports define an endpoint
Responsive-mode deployment/development
Timescales for LHC are soon – first cosmics data taken
Oversight Committee – improve “efficiency”
Some particular issues:
– Tier-1/A utilisation
– Documentation Officer
• “Get Fit” plan endorsed by OC
– requires support from everyone to improve metrics
– There are 14 deployment problems (some interdependency) that need to
be solved
– Many areas are now quantifiable (significant progress here)
– Service Challenges will help focus attention
– Improved communication and documentation (become a scientologist?!)
• Aim: measured end-to-end performance improvements during 2005
• Beyond GridPP2: input required over the summer to PPARC LHC
exploitation planning review
4 July 2005
GridPP13 Collaboration Meeting
Tony Doyle - University of Glasgow