CASTOR SRM v1.1 experience Presentation at HEPiX MSS Forum 28/05/2004 Olof Bärring, CERN-IT.
Download ReportTranscript CASTOR SRM v1.1 experience Presentation at HEPiX MSS Forum 28/05/2004 Olof Bärring, CERN-IT.
CASTOR SRM v1.1 experience Presentation at HEPiX MSS Forum 28/05/2004 Olof Bärring, CERN-IT Outline • • • • Brief overview of SRM v1.1 CASTOR implementation Interoperability tests Problems found – SRM specification – GSI • SRM @ GGF: GSM WG – Input to the definition of SRM-Basic • Conclusions and outlook 28/05/2004 CASTOR SRM v1.1 experience 2 Brief overview of SRM v1.1 • SRM = Storage Resource Manager • First (v1.0) interface definition – – – – http://sdm.lbl.gov/srm-wg/doc/srm.v1.0.pdf October 22, 2001 JLAB, FNAL and LBNL Some key features: 28/05/2004 • • • • Transfer protocol negotiation Multi-file requests Asynchronous operations SRM is a management interface – Make files “available” for access (e.g. recall to disk) – Prepare resources for receiving files (e.g. allocate disk space) – Query status of requests or files managed by the SRM – Not a WAN file transfer protocol • URLs – SURL – Site specific URL. Protocol neutral » srm://castorgrid.cern.ch/castor/home/me/test – TURL – Transfer URL. Protocol specific » gsiftp://gridftp03.cern.ch/tmp/home/me/test CASTOR SRM v1.1 experience 3 SRM v1.0 operations get Recall from tape and pin on disk put Reserve disk space, pin and maybe make permanent getRequestStatus Get the status of a running get/put setFileStatus Set the status of a file pin Pin file on disk unPin Cancel a previous pin operation mkPermanent Make existing file permanent getProtocols Get list of supported transfer/access protocols getFileMetadata Get file metadata advisoryDelete Recommend SRM to delete a file getEstGetTime Fake ‘get’ for time estimation getEstPutTime Fake ‘put’ for time estimation Asynchronous 28/05/2004 Synchronous/stateless CASTOR SRM v1.1 experience 4 The ‘copy’ operation • SRM v1.1 == SRM v1.0 + ‘copy’ • ‘copy’ quite different from other SRM operations: – Copy file(s) from/to local SRM to/from another (optionally remote) SRM – The target SRM performs the necessary ‘put’ and ‘get’ operations and executes the file transfers using the negotiated protocol (e.g. gsiftp) • The ‘copy’ operation allows a batch job running on a worker node without in&out-bound WAN access to copy files to a remote storage element • The ‘copy’ operation was documented only 4 days ago(!) • The ‘copy’ operation could potentially provide the framework for planning transfers of a large data volumes (e.g. LHC T0 T1 data broadcasting)?? 28/05/2004 CASTOR SRM v1.1 experience 5 CASTOR SRM v1.1 • Implements the vital operations – get, put, getRequestStatus, setFileStatus, getProtocols • No-ops: – pin, unPin, getEstGetTime, getEstPutTime • Implemented but optionally disabled (requested by LCG) – advisoryDelete • CASTOR GSI (CGSI) plug-in for gSOAP – Also used in GFAL • Evolution @ CERN: – First prototype in summer 2003 – First production version deployed in December 2003 • Other sites having deployed the CASTOR SRM – CNAF (INFN/Bologna) – PIC (Barcelona) 28/05/2004 CASTOR SRM v1.1 experience 6 CASTOR SRM v1.1 Grid services SRM request repository GSI GSI SRM gridftp CASTOR disk cache stager CASTOR name space RFIO Local clients CASTOR tape archive Tape queue Volume Manager 28/05/2004 Tape mover CASTOR SRM v1.1 experience 7 Interoperability tests • CASTOR SRM has been running interoperability tests with various clients, notably – GFAL (Jean-Philippe) – EDG replica manager (Peter) – FNAL/dCache SRM (Timur) 28/05/2004 CASTOR SRM v1.1 experience 8 Problems found • The interoperability problems can be classified as: – Due to problems with the SRM specification – Due to assumptions in SRM or SOAP implementations – Due to GSI incompatibilities • The debugging of GSI incompatibilities is by far the most difficult and time consuming 28/05/2004 CASTOR SRM v1.1 experience 9 Problems with SRM spec (1) • Lack of enumeration – All enumeration-like types are strings – Client needs to find a common denominator (e.g. cast all strings in capital letters) • Request and file state lifecycles – Concise for ‘put’ or ‘get’ – Undefined for ‘copy’ (a proposal was circulated 4 days ago). This turned out to be an important interoperability issue between CERN/CASTOR and FNAL/dCache SRMs – Undefined for ‘mkPermanent’, ‘pin’, ‘unpin’ (probably irrelevant for the latter two)? • Request history – What an SRM should with requests that have reached the “Done” or “Failed” status 28/05/2004 CASTOR SRM v1.1 experience 10 Problems with SRM spec (2) • Immutability of request identifier – Request id is a 32 bit word – Unspecified if an SRM can reuse request ids for finished (“Done” or “Failed”) requests • SURL (Site URL) semantics – Is it an URL or URI? – If URL, does it support relative and absolute paths? – If URI name space is virtually flat for an arbitrary client • Pin lifetime – Pin lifetime is defined to be subject for site policy – No way to query the remaining pin lifetime for a particular file 28/05/2004 CASTOR SRM v1.1 experience 11 Problems with SRM spec (3) • Exception handling and error propagation – Unspecified if a multi-file request should fail when a subset of the files got an error – Unspecified if and when an SRM can do retries – Only one error message, global for all files in a multi-file request, is available for reporting – Format and contents of error message undefined • advisoryDelete != delete – It may be vital to know what the effect is • No effect at all (if so, what happens if SURL is reused for a new file?) • Only remove disk resident copy (if so, when?) • Remove HSM file (if so, when?) • Directory creation on the fly for ‘put’ requests – If a ‘put’ requests specifies a SURL corresponding to a path for which one or several sub-directory levels do not exist, should it create the missing dirs on the fly (provided the client has the appropriate permissions)? 28/05/2004 CASTOR SRM v1.1 experience 12 Problems due to SRM or SOAP implementation details • SRM WSDL discovery – FNAL client assumed wsdl and service are hosted by same web-server • Bug in gSOAP v2.3 WSDL importer • Various bugs in CASTOR SRM found but not reported here 28/05/2004 CASTOR SRM v1.1 experience 13 GSI problems (1) • CASTOR (GSI) – EDG RC (Java TrustManager) – TrustManager does not use GSI default of SSL handshake + credential delegation, but just a SSL handshake – TrustManager client would not work with SSL 3.0, which is forced by GSI – Solution: EDG RC uses CoG (Globus Java Security Implementation) instead • CASTOR (GSI) – FNAL dCache (Java CoG) – FNAL client only used a limited number of algorithms for encryption that were not matching those provided by standard GSI – Limited Proxy certificate • GSI error reporting not working properly 28/05/2004 CASTOR SRM v1.1 experience 14 GSI problems (2) • Administration and deployment issues – EDG globus patch for supporting for dynamic pool accounts requires GRIDMAPDIR environment to be declared, even if default location was used for the security files – configuration problems (right Root CA not trusted) – CERN CA changed the Certificate naming scheme (number added at the end of DN). New certificates were not automatically propagated (to, for instance, FNAL). • The effort for debugging GSI problems will scale with the number of SRM implementations – Establishing a ‘SRM reference implementation’ for certifying new servers and clients would help 28/05/2004 CASTOR SRM v1.1 experience 15 SRM @ GGF: GSM WG • GGF GSM (Grid Storage Management) WG – SRM interface specification for GGF will proceed in two steps • SRM-Basic • SRM-Advanced – Current proposal is to have • SRM-Basic relatively close to SRM v1.1 • SRM-Advanced close to SRM v2.1 + vaguely defined features like authorization, access control, monitoring • Suggestion to HEPiX MSS forum how we could use GSM WG – SRM-Basic is hopefully sufficient for LHC Tier-0 Tier-1 data distribution. With that objective it is essential that • all existing interoperability problems with SRM v1.1 definition are addressed as appropriate • adding of new features should be kept at the minimum necessary – Hopefully we have already come up with some input during these two days 28/05/2004 CASTOR SRM v1.1 experience 16 Conclusions and outlook • CASTOR SRM v1.1 is in production since a couple of months at CERN and some other CASTOR Tier1 sites • SRM interoperability does not come for free – Definition not concise enough, room for too much site specific interpretation – Is GSI interoperability an illusion and, if so, will it continue to be so? • We have currently no plans for a CASTOR SRM v2.1 implementation. Would rather like to tighten up SRM v1.1 in the context of the GGF GSM WG and the SRM-Basic definition 28/05/2004 CASTOR SRM v1.1 experience 17