CASTOR SRM v1.1 experience Presentation at HEPiX MSS Forum 28/05/2004 Olof Bärring, CERN-IT.

Download Report

Transcript CASTOR SRM v1.1 experience Presentation at HEPiX MSS Forum 28/05/2004 Olof Bärring, CERN-IT.

CASTOR SRM v1.1 experience
Presentation at HEPiX MSS Forum
28/05/2004
Olof Bärring, CERN-IT
Outline
•
•
•
•
Brief overview of SRM v1.1
CASTOR implementation
Interoperability tests
Problems found
– SRM specification
– GSI
• SRM @ GGF: GSM WG
– Input to the definition of SRM-Basic
• Conclusions and outlook
28/05/2004
CASTOR SRM v1.1 experience
2
Brief overview of SRM v1.1
• SRM = Storage Resource Manager
• First (v1.0) interface definition
–
–
–
–
http://sdm.lbl.gov/srm-wg/doc/srm.v1.0.pdf
October 22, 2001
JLAB, FNAL and LBNL
Some key features:
28/05/2004
•
•
•
•
Transfer protocol negotiation
Multi-file requests
Asynchronous operations
SRM is a management interface
– Make files “available” for access (e.g. recall to disk)
– Prepare resources for receiving files (e.g. allocate disk space)
– Query status of requests or files managed by the SRM
– Not a WAN file transfer protocol
• URLs
– SURL – Site specific URL. Protocol neutral
» srm://castorgrid.cern.ch/castor/home/me/test
– TURL – Transfer URL. Protocol specific
» gsiftp://gridftp03.cern.ch/tmp/home/me/test
CASTOR SRM v1.1 experience
3
SRM v1.0 operations
get
Recall from tape and pin on disk
put
Reserve disk space, pin and maybe make permanent
getRequestStatus
Get the status of a running get/put
setFileStatus
Set the status of a file
pin
Pin file on disk
unPin
Cancel a previous pin operation
mkPermanent
Make existing file permanent
getProtocols
Get list of supported transfer/access protocols
getFileMetadata
Get file metadata
advisoryDelete
Recommend SRM to delete a file
getEstGetTime
Fake ‘get’ for time estimation
getEstPutTime
Fake ‘put’ for time estimation
Asynchronous
28/05/2004
Synchronous/stateless
CASTOR SRM v1.1 experience
4
The ‘copy’ operation
• SRM v1.1 == SRM v1.0 + ‘copy’
• ‘copy’ quite different from other SRM operations:
– Copy file(s) from/to local SRM to/from another
(optionally remote) SRM
– The target SRM performs the necessary ‘put’ and ‘get’
operations and executes the file transfers using the
negotiated protocol (e.g. gsiftp)
• The ‘copy’ operation allows a batch job running
on a worker node without in&out-bound WAN
access to copy files to a remote storage element
• The ‘copy’ operation was documented only 4 days
ago(!)
• The ‘copy’ operation could potentially provide the
framework for planning transfers of a large data
volumes (e.g. LHC T0  T1 data broadcasting)??
28/05/2004
CASTOR SRM v1.1 experience
5
CASTOR SRM v1.1
• Implements the vital operations
– get, put, getRequestStatus, setFileStatus, getProtocols
• No-ops:
– pin, unPin, getEstGetTime, getEstPutTime
• Implemented but optionally disabled (requested
by LCG)
– advisoryDelete
• CASTOR GSI (CGSI) plug-in for gSOAP
– Also used in GFAL
• Evolution @ CERN:
– First prototype in summer 2003
– First production version deployed in December 2003
• Other sites having deployed the CASTOR SRM
– CNAF (INFN/Bologna)
– PIC (Barcelona)
28/05/2004
CASTOR SRM v1.1 experience
6
CASTOR SRM v1.1
Grid services
SRM request
repository
GSI
GSI
SRM
gridftp
CASTOR disk cache
stager
CASTOR name
space
RFIO
Local
clients
CASTOR tape archive
Tape queue
Volume
Manager
28/05/2004
Tape
mover
CASTOR SRM v1.1 experience
7
Interoperability tests
• CASTOR SRM has been running
interoperability tests with various clients,
notably
– GFAL (Jean-Philippe)
– EDG replica manager (Peter)
– FNAL/dCache SRM (Timur)
28/05/2004
CASTOR SRM v1.1 experience
8
Problems found
• The interoperability problems can be
classified as:
– Due to problems with the SRM specification
– Due to assumptions in SRM or SOAP
implementations
– Due to GSI incompatibilities
• The debugging of GSI incompatibilities is
by far the most difficult and time
consuming
28/05/2004
CASTOR SRM v1.1 experience
9
Problems with SRM spec (1)
• Lack of enumeration
– All enumeration-like types are strings
– Client needs to find a common denominator (e.g. cast all
strings in capital letters)
• Request and file state lifecycles
– Concise for ‘put’ or ‘get’
– Undefined for ‘copy’ (a proposal was circulated 4 days
ago). This turned out to be an important interoperability
issue between CERN/CASTOR and FNAL/dCache SRMs
– Undefined for ‘mkPermanent’, ‘pin’, ‘unpin’ (probably
irrelevant for the latter two)?
• Request history
– What an SRM should with requests that have reached
the “Done” or “Failed” status
28/05/2004
CASTOR SRM v1.1 experience
10
Problems with SRM spec (2)
• Immutability of request identifier
– Request id is a 32 bit word
– Unspecified if an SRM can reuse request ids for finished
(“Done” or “Failed”) requests
• SURL (Site URL) semantics
– Is it an URL or URI?
– If URL, does it support relative and absolute paths?
– If URI  name space is virtually flat for an arbitrary
client
• Pin lifetime
– Pin lifetime is defined to be subject for site policy
– No way to query the remaining pin lifetime for a
particular file
28/05/2004
CASTOR SRM v1.1 experience
11
Problems with SRM spec (3)
• Exception handling and error propagation
– Unspecified if a multi-file request should fail when a subset of
the files got an error
– Unspecified if and when an SRM can do retries
– Only one error message, global for all files in a multi-file
request, is available for reporting
– Format and contents of error message undefined
• advisoryDelete != delete
– It may be vital to know what the effect is
• No effect at all (if so, what happens if SURL is reused for a new
file?)
• Only remove disk resident copy (if so, when?)
• Remove HSM file (if so, when?)
• Directory creation on the fly for ‘put’ requests
– If a ‘put’ requests specifies a SURL corresponding to a path for
which one or several sub-directory levels do not exist, should
it create the missing dirs on the fly (provided the client has
the appropriate permissions)?
28/05/2004
CASTOR SRM v1.1 experience
12
Problems due to SRM or SOAP
implementation details
• SRM WSDL discovery
– FNAL client assumed wsdl and service are
hosted by same web-server
• Bug in gSOAP v2.3 WSDL importer
• Various bugs in CASTOR SRM found but
not reported here 
28/05/2004
CASTOR SRM v1.1 experience
13
GSI problems (1)
• CASTOR (GSI) – EDG RC (Java TrustManager)
– TrustManager does not use GSI default of SSL
handshake + credential delegation, but just a SSL
handshake
– TrustManager client would not work with SSL 3.0, which
is forced by GSI
– Solution: EDG RC uses CoG (Globus Java Security
Implementation) instead
• CASTOR (GSI) – FNAL dCache (Java CoG)
– FNAL client only used a limited number of algorithms for
encryption that were not matching those provided by
standard GSI
– Limited Proxy certificate
• GSI error reporting not working properly
28/05/2004
CASTOR SRM v1.1 experience
14
GSI problems (2)
• Administration and deployment issues
– EDG globus patch for supporting for dynamic pool
accounts requires GRIDMAPDIR environment to be
declared, even if default location was used for the
security files
– configuration problems (right Root CA not trusted)
– CERN CA changed the Certificate naming scheme
(number added at the end of DN). New certificates were
not automatically propagated (to, for instance, FNAL).
• The effort for debugging GSI problems will scale
with the number of SRM implementations
– Establishing a ‘SRM reference implementation’ for
certifying new servers and clients would help
28/05/2004
CASTOR SRM v1.1 experience
15
SRM @ GGF: GSM WG
• GGF GSM (Grid Storage Management) WG
– SRM interface specification for GGF will proceed in two steps
• SRM-Basic
• SRM-Advanced
– Current proposal is to have
• SRM-Basic relatively close to SRM v1.1
• SRM-Advanced close to SRM v2.1 + vaguely defined features like
authorization, access control, monitoring
• Suggestion to HEPiX MSS forum how we could use GSM WG
– SRM-Basic is hopefully sufficient for LHC Tier-0  Tier-1 data
distribution. With that objective it is essential that
• all existing interoperability problems with SRM v1.1 definition are
addressed as appropriate
• adding of new features should be kept at the minimum necessary
– Hopefully we have already come up with some input during
these two days
28/05/2004
CASTOR SRM v1.1 experience
16
Conclusions and outlook
• CASTOR SRM v1.1 is in production since a couple
of months at CERN and some other CASTOR Tier1 sites
• SRM interoperability does not come for free
– Definition not concise enough, room for too much site
specific interpretation
– Is GSI interoperability an illusion and, if so, will it
continue to be so? 
• We have currently no plans for a CASTOR SRM
v2.1 implementation. Would rather like to tighten
up SRM v1.1 in the context of the GGF GSM WG
and the SRM-Basic definition
28/05/2004
CASTOR SRM v1.1 experience
17