Bulk Data Copy Activity Description (JSDL and/or DMI?)

Download Report

Transcript Bulk Data Copy Activity Description (JSDL and/or DMI?)

Bulk Data Copy Description Generalizations
(some DMI/JSDL overlap)
Bulk Copying: Recursive file/dir copying
between multiple sources and sinks
(potentially a draft straw-man proposal for a
‘Bulk Copy Document’?)
[email protected]
Overview
•
•
•
•
•
•
Some Overlap in Data Copy Activity Descriptions (JSDL and DMI)
JSDL Data staging and Bulk copies
DMI and bulk copies
Some new draft proposals for DMI To address Bulk Data Copying
Reuse of proposed DMI-common element set
Some other stuff to consider
Some Overlap in Data Copy Activity Descriptions
(JSDL and DMI)
• Some overlap between JSDL Data Staging and DMI.
• The Source/Target <jsdl:DataStaging/> element is roughly similar to
Source/Sink <dmi:DEPR/> element.
• Both capture the source/target URI and credentials.
• At present, neither JSDL DS or DMI fully captures our requirements (this is
not a criticism, they are each intended to address their existing use cases
which only partially overlap with the requirements for a bulk data copy
activity !).
Other
• Condor Stork - based on Condor Class-Ads (see supplementary slides)
• Not sure if Globus has/intends a similar definition in its new developments
(e.g. SaaS) anyone ?
Using JSDL Data Staging elements to simulate a bulk data copy activity
Bulk Copy: Recursive file/dir copying between multiple sources and sinks
JSDL DATA STAGING AND BULK
COPIES
JSDL Data Staging and the HPC File Staging Profile for
Bulk Data Copying
<jsdl:DataStaging>
<jsdl:FileName>fileA</jsdl:FileName>
<jsdl:CreationFlag>overwrite</jsdl:CreationFlag>
<jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination>
<jsdl:Source>
<jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI>
</jsdl:Source>
<jsdl:Target>
<jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI>
</jsdl:Target>
<Credentials> … </Credentials>
</jsdl:DataStaging>
JSDL
Staging 1
Define both the source and target within the same <DataStaging/> element which is
permitted in JSDL.
But the HPC File Staging Profile (Wasson et al. 2008) limits to a single credential
definition within a data staging element.
Possibility; maybe profile use of Credentials within Source/Target elements ?
<jsdl:DataStaging>
<jsdl:FileName> fileA </jsdl:FileName>
<jsdl:FilesystemName> MY_SCRATCH_DIR </jsdl:FilesystemName>
<jsdl:CreationFlag> overwrite </jsdl:CreationFlag>
<jsdl:DeleteOnTermination> true </jsdl:DeleteOnTermination>
<jsdl:Source>
<jsdl:URI> gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA </jsdl:URI>
</jsdl:Source>
<Credentials> e.g. MyProxyToken </Credentials>
</jsdl:DataStaging>
JSDL
Staging 2
<jsdl:DataStaging>
<jsdl:FileName> fileA </jsdl:FileName>
<jsdl:FilesystemName> MY_SCRATCH_DIR </jsdl:FilesystemName>
<jsdl:CreationFlag> overwrite </jsdl:CreationFlag>
<jsdl:Target>
<jsdl:URI> ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA </jsdl:URI>
</jsdl:Target>
<Credentials> e.g. wsa:Username/password token </Credentials>
</jsdl:DataStaging>
•
A source element for fileA and a corresponding target element for staging-out of the
same file.
•
Link <DataStaging/> elements via common <FileName/> and <FilesystemName/>.
•
By specifying that the input file is deleted after the job has executed, staging can be used
to perform a data copy from one location to another via the staging host (intermediary) .
Using Staging to Enact Bulk Copies
• In the context of bulk copying, the file staging host (intermediary) is
redundant:
– No need to explicitly name and aggregate (stage) files on a staging host (when
copying between a source and sink, the staging host is a hidden implementation
detail).
• No equivalent <dmi:DataLocations/> for defining alternative locations for a
source and sink (a nice feature of DMI).
• JSDL is designed to describe a single activity which is atomic from the
perspective of an external user (staging is part of this atomic activity). In bulk
copying, we need to identify and report on the status of each copy operation.
• Some additional elements are required (e.g. <dmi:TransferRequirements/>,
<other:FileSelector/>, abstract <URIConnectionProperties/> for connecting to
different URI schemes, e.g. iRODS/SRB require ‘McatZone’ ‘defaultResoruce’
propertes). Are these new elements out of scope (remain proprietary?)
An overview of OGSA DMI and some current limitations for Bulk Copying
DMI AND BULK COPIES
OGSA DMI Overview
• The OGSA Data Movement Interface (DMI) (Antonioletti et al. 2008) defines a
number of elements for describing and interacting with a data transfer activity.
• The data source and destination are each described separately with a Data
End Point Reference (DEPRs), which is a specialized form of WS-Address
element (Box et al. 2004).
• In contrast to the JSDL data staging model, a DEPR facilitates the definition of
one or more <Data/> elements within a <DataLocations/> element. This is
used to define alternative locations for the data source and/or sink.
• An implementation can select between its supported protocols and
select/retry different source/sink combinations (improves resilience and the
likelihood of performing a successful copy).
DMI DEPR
and Transfer
Requirements
Source or Sink
(wsa:EndpointRefer
ence type)
Transfer
Requirements
(needs some
extending)
<dmi:SourceOrSinkDataEPR>
<wsa:Address>http://www.ogf.org/ogsa/2007/08/addressing/none</wsa:Address>
<wsa:Metadata>
<dmi:DataLocations>
<dmi:Data ProtocolUri="http://www.ogf.org/ogsadmi/2006/03/im/protocol/gridftp-v20"
DataUrl="gsiftp://example.org/name/of/the/dir/">
<dmi:Credentials><other:MyProxyToken/></dmi:Credentials>
<other:stuff/>
</dmi:Data>
<dmi:Data ProtocolUri="urn:my-project:srm"
DataUrl="srm://example.org/name/of/the/dir/">
<dmi:Credentials><wsse:UsernameToken/></dmi:Credentials>
<other:stuff/>
</dmi:Data>
</dmi:DataLocations>
</wsa:Metadata>
</dmi:SourceOrSinkDataEPR>
DEPR defines alternative locations for the
data source /sink and each <Data/> nests
<dmi:TransferRequirements>
its own credentials.
<dmi:StartNotBefore/> ?
<dmi:EndNoLaterThan/> ?
<dmi:StayAliveTime/> ?
<dmi:MaxAttempts/> ?
</dmi:TransferRequirements>
DMI Data Transfer Factory Interface (representation)
[supported protocols] +
[service instance] GetDataTransferInstance([SourceDEPR],[SinkDEPR],[TransferRequirements]);
[factory attributes] GetFactoryAttributesDocument();
Current DMI Limitations for Bulk Copying
(for multiple sources and sinks)
• DMI is intended to describe only a single data copy operation
between one source and one sink (this is not a criticism, this is by
design for managing low-level transfers of single data units). To do
several transfers, client needs to perform multiple invocations of a
DMI service factory would be required to create multiple DMI service
instances.
• We require a single message packet that wraps multiple transfers into
a single ‘atomic’ activity rather than having to repeatedly invoke the
DMI service factory (broadly similar to defining multiple JSDL data
staging elements).
• Some of the existing functional spec elements require extension /
slight modification (in particular addition of <xsd:any/> and
<xsd:anyAttribute/> extension points to embed proprietary info in
suitable locations).
Note, The draft proposals presented here for bulk data copying are only
intended for review/discussion/sanity-check/agreement (or not)
SOME NEW DRAFT PROPOSALS FOR
DMI TO ADDRESS BULK DATA
COPYING
Draft Proposal 1 – New <BulkDataCopy/> and <DataCopy/> Elements
•
Add new elements to describe a bulk copy activity – effectively wrap multiple source-sink pairs
within a single (standalone) document e.g. <BulkDataCopy/> with nested <DataCopy/>
<!-- Draft: TO REVISE/DISCUSS/SANITY-CHECK -->
<BulkDataCopy id="xsd:ID"?>
<DataCopy id="xsd:ID"?> + <!--one-to-many-->
<SourceDEPR/>
<SinkDEPR/>
<DataCopyTransferRequirements/> ? <!-- needed ? -->
<xsd:any##other/> *
<DataCopy/>
<TransferRequirements/> ?
<xsd:any##other/>*
</BulkDataCopy>
Big Disclaimer: needs discussion,
revision, sanity check,
agreement (or not) etc….
•
The outer <TransferRequirements/> applies to the whole bulk copy (wrapping elements that span
all the sub-copies, e.g. including the <dmi:MaxAttempts/>, <dmi:StartNotBefore/> and other batchwindow properties).
•
Define an optional <DataCopyTransferRequirements/> for each <DataCopy/> in order to specify an
additional and overriding requirement sub-set (e.g. for defining <FileSelector/> elements etc).
Draft Proposal 2 – Introduce a New DMI Port Type
•
•
•
Add a new DMI port type to accept <BulkDataCopy/> doc (current port type
defines separate [SourceDEPR], [SinkDEPR], [TransferRequirements] arguments).
Choice of two port types.
Some minor changes to the existing functional spec (mostly adding xsd:any
extension points and other small stuff).
Possible DMI Data Transfer Factory Interface Extension (draft representation)
[supported protocols] +
[service instance] GetDataTransferInstance([BulkDataCopy]);
[factory attributes] GetFactoryAttributesDocument();
•
As per the existing Functional Spec; completely separate the activity description
(BulkDataCopy) from the service interface rendering in order to define a generic
and reusable element set.
Big Disclaimer: needs discussion, revision, sanity check, agreement (or not) etc….
Draft Proposal 3 – Extend <State/> and <InstanceAttributes/> and
describe usage for bulk copying
•
•
Since a Bulk Copy consists of multiple transfers, we need to optionally provide a way to
report the status of each sub-copy.
The (sub) state of each <DataCopy/> could be optionally nested within the
<dmi:Detail/> element as part of the parent <dmi:State/> (i.e. in place of the existing
<xsd:any/> extension point). In order to specify each sub-copy identifier, the
<dmi:State/> could be extended by adding an <xsd:anyAttribute /> :
<!-- Draft: TO REVISE/DISCUSS/SANITY-CHECK -->
<dmi:State value=“Transferring”>
<dmi:Detail>
<dmi:State dataCopyId=“subcopy1” value=“Done”>
<dmi:State dataCopyId=“subcopy3” value=“Failed:Unclean”>
<dmi:State dataCopyId=“subcopy2” value=“Transferring”>
. . .
</dmi:Detail>
</dmi:State>
•
Similarly, child <dmi:InstanceAttributes/> could be optionally nested within a parent
<dmi:InstanceAttributes/> to represent each sub-copy using a similar approach. But is
this actually necessary ? (don’t think so since the <dmi:TotalDataSize/> could be
calculated across all the sub-copies).
Big Disclaimer: needs discussion, revision, sanity check, agreement (or not) etc….
Draft Proposal 4 – Other proposed modifications (possibly some
more not listed here)
•
Add <xsd:any/> and <xsd:anyAttribute/> extension points to the existing DMI elements, e.g.
in dmi:DataType dmi:DataLocationsType complex types, anyAttribute in dmi:State etc….
<complexType name="DataType">
<annotation> . . . </annotation>
<sequence>
<element name="Credentials" type="dmi:CredentialsType" minOccurs="0" />
<xsd:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="ProtocolUri" type="anyURI" use="required" />
<attribute name="DataUrl" type="anyURI" use="required" />
<xsd:anyAttribute namespace="##other" processContents="lax"/>
</complexType>
<complexType name="DataLocationType">
<annotation> . . . </annotation>
<sequence>
<element name="Data" type="dmi:DataType" maxOccurs="unbounded" />
<xsd:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<xsd:anyAttribute namespace="##other" processContents="lax"/>
</complexType>
•
Move elements referred to in the text of the functional spec into the functional spec schema, such as
<FactoryAttributes/> and the fault types (currently defined in the plain WS Rendering schema).
•
Some additional elements are required (e.g. <dmi:TransferRequirements/>, <other:FileSelector/>,
abstract <URIConnectionProperties/> for connecting to different URI schemes, e.g. iRODS/SRB require
‘McatZone’ ‘defaultResoruce’ propertes). Are these new elements out of scope or should they remain
proprietary?)
As per the existing DMI Functional Spec, the Bulk Copy activity
description would be clearly separated from the service interface
rendering . This promotes a generic and reusable element set which can
be adopted for use within other specs/profiles , e.g. a new bulk copy
application definition for the <jsdl:Application/> element.
REUSE OF PROPOSED DMICOMMON ELEMENT SET
<jsdl:JobDefinition>
<jsdl:JobDescription>
<jsdl:JobIdentification ... />
<jsdl:Application>
<!–
Possibility? Embed new ‘BulkDataCopy’ doc as a new Application
element akin to POSIXApplication or HPCProfileApplication elems
-->
<other:BulkDataCopyApplication>
<dmi:BulkDataCopy>
. . .
</dmi:BulkDataCopy>
</other:BulkDataCopyApplication>
</jsdl:Application>
Draft usage in JSDL 1
<jsdl:Resources/>
</jsdl:JobDescription>
</jsdl:JobDefinition>
JSDL intended to be a generic compute activity description language (not just solely HPC).
a) In this example, a bulk data copy activity doc is used to describe as a jsdl application.
b) Could nest the proposed <BulkDataCopy/> document within the <jsdl:Application/>
element. The <jsdl:Application/> element is a generic wrapper that is intended for this
very purpose, e.g. akin to nesting <POSIXApplication/> or <HPCProfileApplication/>.
<jsdl:JobDefinition>
<jsdl:JobDescription>
<jsdl:JobIdentification ... />
<jsdl:Application>
<!–
Possibility? Stage BulkDataCopy doc and explicitly name the copy agent
that would enact the copy activity
-->
<jsdl-posix:POSIXApplication>
<jsdl-posix:Executable>/usr/bin/datacopyagent.sh<jsdl-posix:Executable>
<jsdl-posix:Argument>‘my_BulkDataCopyDoc.xml’</jsdl-posix:Argument>
</jsdl-posix:POSIXApplication>
</jsdl:Application>
Draft usage in JSDL 2
<jsdl:Resources>
<jsdl:DataStaging>
<jsdl:FileName>my_BulkDataCopyDoc.xml</jsdl:FileName> . . .
</jsdl:DataStaging>
</jsdl:Resources>
</jsdl:JobDescription>
</jsdl:JobDefinition>
This is a less ‘contract-driven’ approach, but represents a perfectly valid re-use of the
proposed <BulkDataCopy/> Document.
Stage-in <BulkDataCopy/> document as input for the executable.
Draft DMI sub-state specialisations in BES
•
•
•
•
Profile the OGSA BES state model to account for DMI sub-state specializations and
dmi lifecycle events ().
Adds optional DMI sub-state specializations. Client/service may only recognize the
main BES states if necessary.
Adds optional DMI lifecycle events (dmi:suspend, dmi:resume).
Add DMI fault types?
Cancelled
bes:TerminateActivities ()
Request
Running:
Transferring
Pending
dmi:Suspend ()
Request
Failed:
Clean
Unclean
Unknown
dmi:Resume ()
Request
Running:
Suspended
Finished
BES states
DMI sub-states
Bes and DMI Lifecycle Events in
italics (i.e. Requests/operations)
Some other stuff to consider
•
JSDL-BES may be a better route for more widespread adoption of a bulk copy document ? (e.g.
consider existing BES implementations)
•
Is orchestration of the proposed <DataCopy/> activities required ? (e.g. sequential /ordering or
even DAG ?). As yet, no compelling use-cases so far.
•
For the proposed bulk copy doc; What about using element references rather than defining solely
‘in-line’ XML docs to cut down on element repetition (e.g. akin to <jsdl:FileSystem/> element
which can be referenced through <jsdl:FilesystemName/> elements). Abstract elements and
Substitution groups may also be useful here.
<BulkDataCopy id=”MyBulkTransferA”>
<CopyResources>
<Credential id=”cred1”.../>
<Credential id=”cred2”.../>
<TransferRequirements id=”tr1” .../>
<TransferRequirements id=”tr2” .../>
<DataEPR id=”data1” .../>
<DataEPR id=”data2” .../>
<DataEPR id=”data3” .../>
</CopyResources>
<DataCopy id=”subTransferA”>
<SourceDEPR idref=”data1”/>
<SinkDEPR idref=”data3”/>
<TransferRequirementsRef idref=”tr1”/>
</DataCopy>
<DataCopy id=”subTransferB”>
<SourceDEPR idref=”data2”/>
<SinkDEPR idref=”data3”/>
<TransferRequirementsRef idref=”tr2”/>
</DataCopy>
</BulkDataCopy>
Element ‘id’ and
subsequent ‘idref’s
Reduces XML repetition
but validation does not
check for the correct
types of referenced
elements.
Supplementary slides
OTHER STUFF / EXTRA SLIDES….
Message Model Requirements
Document Message
• Bulk Data Copy Activity description
• Capture all information required to connect to each source URI and sink URI and
subsequently enact the data copy activity.
• Transfer requirements, e.g. additional URI Properties, file selectors (reg-expression),
scheduling parameters to define a batch-window, retry count, source/sink alternatives,
checksums?, sequential ordering? DAG?
• Serialized user credential definitions for each source and sink.
Control Messages
• Interact with a state/lifecycle model (e.g. stop, resume, cancel)
Event Messages
• Standard fault types and status updates
Information Model
• To advertise the service capabilities / properties / supported protocols
In-Scope
1. Job Submission Description Language (JSDL)
• An activity description language for generic compute applications.
2. OGSA Data Movement Interface (DMI)
• Low level schema for defining the transfer of bytes between and single source and sink.
3. JSDL HPC File Staging Profile (HPCFS)
• Designed to address file staging not bulk copying.
4. OGSA Basic Execution Service (BES)
• Defines a basic framework for defining and interacting with generic compute activities: JSDL
+ extensible state and information models.
5. Others that I am sure that I have missed ! (…ByteIO)
•
Neither fully captures our requirements (not a criticism, they are designed to address their usecases which only partially overlap with the requirements for our bulk data copy activity).
Other
• Condor Stork - based on Condor Class-Ads
• Not sure if Globus has/intends a similar definition in its new developments (e.g. SaaS) anyone ? – I
believe Ravi was originally supportive of a DMI for data transfers between multiple sources/sinks
Stork – Condor Class Ads
Example of a Stork job request:
[ dest_url= "gsiftp://eric1.loni.org/scratch/user/";
arguments = ‐p 4 dbg ‐vb";
src_url = "file:///home/user/test/";
dap_type = "transfer";
• Purportedly the first batch scheduler for data
verify_checksum = true;
placement and data movement in a
verify_filesize = true;
heterogeneous environment . Developed
set_permission = "755" ;
with respect to Condor
recursive_copy = true;
• Uses Condor’s ClassAd job description
network_check = true;
language and is designed to understand the
checkpoint_transfer = true;
semantics and characteristics of data
output = "user.out";
placement tasks
err = "user.err";
log = "userjob.log";
]
• Recent NSF funding to develop as a
production service