Transcript Document

Towards a loosely coupled and scalable component set
for scheduling bulk data copying across different
storage resources as fault tolerant batch jobs.
http://code.google.com/p/dtsproject/
David Meredith1, Stephen Crouch2, Peter Turner3, Gerson Galang4, Ming Jiang5, Hung Nguyen6
1NGS,
Science and Technology Facilities Council, Daresbury Labs, UK, [email protected]
2OMII-UK, School of Electronics and Comp Sci, University of Southampton, UK, [email protected]
3University of Sydney, Sydney, Australia, [email protected]
4Victorian eResearch Strategic Initiative (VeRSI), Victoria, Australia, [email protected]
5NGS, Science and Technology Facilities Council, Daresbury Labs, UK, [email protected]
6University of Sydney, Sydney, Australia, [email protected]
Australia
(DataMINX)
United Kingdom:
Overview / Aims
•
An open-source project developing a set of loosely coupled components for efficiently brokering
data copies between a wide range of (potentially incompatible) storage resources as schedulable,
fault-tolerant batch jobs (ftp, gridftp, srb, irods, sftp, file, webdav, srm?).
•
To scale from small embedded deployments to large distributed deployments through an
expandable ‘worker-node pool’ controlled through message orientated middleware (MOM, JMS).
•
To maximize data access and transfer efficiency through the strategic placement and subscription
of worker-nodes at or between particular data sources/sinks.
•
To be inherently asynchronous and side-step the bandwidth, concurrency and scalability concerns
for clients in networks with limited capability relative to the direct connectivity between the
source and sink.
•
Aims to address geographical-topological deployment concerns by allowing service hosting to be
either centralized (as part of a shared service), or confined to a single institution or domain.
•
Adoption of established design patterns and open source components which are coupled with a
proposal for an open standards based messaging protocol.
•
Employs a single port-type document-centric model, with service semantics defined solely by the
message model.
DTS Features / Intentions 1
1. Encourage a common messaging model
We are engaging with OGF in the definition of an open standard describing a
bulk data copy activity with subsequent control and event messages. The aim is
to provide a key foundation in addressing the challenges of data management.
Ideally standards based; OGF engagement DMI, JSDL, also communications with
Globus, Unicore, GridSAM developers (a longer term perspective).
2.
Platform independence
Includes the worker agent that manages a bulk data copy activity, the message
broker, the message channel adapters that enable the different transports and
protocols, commons VFS.
3.
Adopts well recognized Enterprise Integration Patterns
Described in Hohpe and Woolf (2003): Competing Consumers, Service Activator,
Selective Consumer, Polling Consumer, Message Driven Consumer, Transport
Channel Adapter, Header Based Router.
http://www.enterpriseintegrationpatterns.com
DTS Features / Intentions 2
4.
Value in the correct framework choice – deploy out of the box features in
remoting, scaling, batching:
•
Spring Batch; one of the only open source batch processing frameworks
currently available (purportedly the only?). It provides many functions that are
essential in batch processing.
•
Spring Integration; supports the EAI patterns identified by Hohpe and Woolf.
Importantly it provides a set of inbound and outbound message-channeladaptors for different integration options, both polling and message driven
adapters (e.g. JMS subscription, file/directory polling, RMI, WS, email)
•
Message broker (e.g. Apache ActiveMQ or any JMS 1.2 message-channel MOM
broker).
Buffering data via an intermediary when copying
between incompatible resources / protocols
Client provides single
interface to different
(potentially
incompatible) storage
resources, e.g. Srb
GsiFtp, Ftp, Sftp,
iRODS, file, Webdav.
Client e.g.
Portal/Hermes
Get and Put, or Mem
buffer Bit pipe
Client brokers between
storage resources
when third-party
transfer is not
available.
SRB/
FTP
File operations (list,
upload, download,
delete, rename)
Authentication tokens
(un/pw, x509?)
SFTP/
GSIFTP
Client-Side Intermediary
Benefits
1. Auth tokens only in memory on one computer.
2. Self contained and interactive.
3. Extensible for new and emerging resources/protocols.
Challenges
1. Software is required that is capable of enacting a data copying activity between a variety
of sources and sinks (bit pipe via byte streams or combined get/put).
2. The client must be constantly available throughout the duration of the transfer.
3. Buffering of large quantities of data introduces bandwidth and concurrency concerns for
clients residing on networks with limited capability (e.g. wireless connectivity) relative to
the direct connectivity between the source and sink.
DTS – Remotely Placed Worker Agents
Aim: Strategically place intermediary software agent(s) (e.g. at different institutions,
within a network, at a local source/sink) and remotely invoke an appropriate agent using a
message router with a ‘Bulk Data Copy Activity’ executed as a fault tolerant batch process.
Best practice: process data as close to where it resides as possible.
3 Core DTS Components:
•
Batch/Worker Agent. Software that will mange a bulk data copy activity. Is a batch
operation – automated processing of large volumes of information that is most
efficiently processed without user interaction (fire + forget).
•
Common Message format that describes a data copying activity with subsequent
control and event messages.
• Lists data sources and sinks.
So that the recipient worker
• Transfer requirements.
can access the data on behalf
• User credentials.
of the user.
•
Message Broker/Router for routing of messages to appropriate workers and scaling via
the Competing Consumer pattern .
DTS Architecture (Simplified)
Broker
between
remote
sources and
sinks
Clients
Meta-data system or data
catalogue (ICAT) that
provides list of data URLs
and credentials. OR
lightweight file operations
directly interacting with
source/sink (list, delete,
rename)
Queue
Channel
Data copy activity message.
Data copy: Get/Put or Bit pipe
DTS
workers
Source
Sink
Authentication tokens
(un/pw, myproxy details)
DTS Architecture (Simplified)
Broker
between
local source
and remote
sink (and
vice-versa)
Clients
Message Bus is a combination of a messaging
infrastructure, a common data model and
command set to allow different systems to
communicate through a shared set of interfaces
(our message channels).
http://www.enterpriseintegrationpatterns.com
Facility
Queue
Facility /
Department Y
Source/Sink
Facility /
Department X
Source/
Sink
Home Lab
Deployment Strategies
Small– Local or embedded worker agent
Med – Single worker pool
Large – Multiple worker pools and message
router
Client
P
(Service
Activator)
s
Source
WN
c
Sink
e
1) Lightweight local worker deployment. The worker agent is invoked by a script or is
integrated into an existing application. S = Submit message (bulk copy activity
document), C = Control message, e = Event message.
Worker pool
DMQ
P
s
C
JobQb
ControlQ
e
s
Source
C
c
ReplyQ
Sink
2) Distributed deployment with a single worker pool.
Worker pool A
C
C
JobQ
Router
DMQ
HTTPS
P
s
JobQa
ControlQ
Worker pool B
AControlQ
C
JobQb
C
BJobQ
c
e
ReplyQ
AJobQ
BControlQ
ControlQ
Router
3) Distributed deployment with a multiple worker pools.
s
C
Core Component
Message Router / Broker
Schedule and route messages to strategically
placed worker agents.
Scale with multiple agents using competing
consumer pattern.
Scaling
How can the architecture scale for increasing loads ?
•
Scale Out: Competing Consumer Pattern
To scale horizontally (or scale out) means to add more nodes to a system.
•
Scale Up: Multi-process Service Activator
To scale vertically (or scale up) means to add resources and/or processes to a
single node in a system.
Scale Out – Competing Consumer Pattern
•
•
•
•
Only requirement is that the JMS client and consumer must be able to access the broker .
This provides location independence which enables scaling and clustering of services since
multiple workers can be configured to pull messages from the same queue.
If the service may become overburden and falls behind in its processing, all that is needed is to
turn-up a few more worker instances to listen to the queue.
Consumers do not have to coordinate with each other which improves resilience, since workers
can be added and removed without affecting each other.
JMS client
(Producer)
Queue
depth ok
Broker
(Queue)
Worker
(Consumer)
Basic architecture is repeatable – use multiple brokers and queues as required, (e.g. broker clusters,
master slave brokers etc).
Message Routing
How can the appropriate remote worker(s) be invoked:
• How to invoke a worker(s) that resides at the data source
and/or sink ?
• How to invoke a worker(s) that is installed at my institution or
within a specific network ?
• How to target a specific worker ?
1. Multiple Destinations
2. Message Selectors
3. Hybrid Approach
Message Routing: Multiple Destinations
Multiple static/administered queues can be configured on one broker in order to partition workers
into different groupings.
Main Advantages: Queue depth is directly related to load. Therefore load balancing can be
performed effectively since queues are not polluted with . DTS Should add new queues for different
groupings (e.g. project queues, separate queues for different facilities).
Main Disadvantages: Changes are required on the broker to cater for new worker groupings
(configuration of new administered queues). This does not provide a high level of decoupling
between message producer and consumer since changes are required to the broker.
Request Qa
JMS
clients
Request Qb
Request Qc
Broker
Worker
groups
Group A
(Facility A)
Group B
(Project B)
Group C
(Institution C)
In DTS, multiple destinations
are used to partition static
queue consumer cluster
groups, e.g. Request Q per
facility, beam-line, project,
institution etc.
Message Routing: Message Selectors
Message Selectors - workers can be ‘Selective Consumers‘ and clients can be ‘Specifying Producers’. A message selector is
an expression based on SQL92 conditional syntax, e.g.
Facility=‘FacilityX‘ AND BeamLine=‘ProteinMX’ AND
WorkerAccessKey=‘abcdefadsf_guuid'
•
•
•
•
Filtering is performed by the broker – it delivers only those messages that match the selective consumer’s criteria.
Importantly, workers can therefore decide which messages to process depending on their own selector statements.
Main benefit is that this approach is extensible: provides for a higher level of decoupling between message producer
and receiver since clients and workers can be easily added without change to the broker.
Selectors are optional, this pattern can also be combined with multiple destination approach to route messages as
required (hybrid approach).
Selectors can be used to perform fine-grained routing and route messages however you require, e.g.
•
Route to first available worker in a particular group that specifies a common/shared selector value, e.g. a common
‘groupID’ AND/OR ‘networkID’ AND/OR ‘facilityGroup’ AND/OR ‘domain’ AND/OR ‘GB limit’ etc…. (SQL).
•
Can route to a specific worker using a unique and opaque client identifier/access key, e.g. GUUID (this is ok since the
broker performs filtering so different workers don’t see each others selectors). Specifying producer would need to
persist this value between server re-starts/different sessions.
=
Specifying
Producers
Request Q
=
=
Messages with selection values
Selective
Consumers
Message Routing: Hybrid Approach
Best approach is to use a combination of the message filtering approach and the multidestination approach to suit your service instance requirements.
Each approach is not mutually exclusive and can be used together provided both patterns are
catered for in your system.
Request Qa
Request Qb
Request Response
(Client Worker Conversation)
1. ReplyTo header
2. Application ID exchange with message filtering
3. Temporary queues
Request Response (Conversation)
Request message contains a Return Address that indicates where to send the reply.
1. Return Address is added to the message header.
2. Consumer does not need to know where to send the reply, it can just ask the request.
Reply Channel 1
Reply Channel 2
Request Channel
Specifying
Producers
(Clients)
Reply Channel 1
Selective
Consumer
(Workers)
Reply Channel 2
Variations of this pattern depending on clients requirements:
a) Further expand the Message Filtering Approach to Exchange client and worker Application IDs. Client can
also selectively consume response messages with its own client ID added to request header.
b) Temporary queue created by the client (lasts only for duration of client session).
Request Response (Conversation) using Filtering
DTS Clients
DTS Workers
Q Consumer Cluster ‘facilityA’
JMS Message Headers
MessageID = guuidA
WorkerGroupID = facilityA
ClientID = DTSClient1
NGS Portal
(An App.
Bounded to
facilityA )
MDP Selective Consumer Pool
on WorkerGroupID = facilityA
MDP Producer Pool
Connected to InvokeClientQ
JobSubmitQ
MDP Selective Consumer Pool
on WorkerID = workerA
DTS Client1
1)
MDP Producer Pool
Connected to JobSumitQ
MDP Selective Consumer Pool
on ClientID = DTSClient1
MDP Producer Pool
Connected to InvokeWorkerQ
2)
JMS Message Headers
CorrelationID = guuidA
WorkerID = workerA
ClientID = DTSClient1
3)
InvokeClientQ
Q Consumer Cluster ‘facilityB’
GridSAM (An
App. Bounded
to facilityB )
(Exchange of client and worker Application
IDs so that recipient worker and client can
converse)
JMS Message Headers
CorrelationID = guuidA
WorkerID = workerA
ClientID = DTSClient1
InvokeWorkerQ
Request Response (Conversation) using Filtering
•
Each JMS client (worker and client) has a unique instance/application ID (clientID, workerID).
1. A client sends a job request and adds its own clientID to the headers (in conjunction with
the other headers used in message selection, e.g. MessageID and WorkerGroupID).
2. Worker picks up a message and responds to an administered response queue (not a
dynamic queue) via the ReplyTo header and itself returns its own WorkerID and forwards
the given ClientID in the message header.
3. Client receives messages from the response queue and filters on ClientID.
4. Client can now converse with the recipient worker since both the client and worker have
their respective IDs and can correlate messages on the original message ID using
CorrelationID.
•
•
•
Using this approach only requires a limited number of administered queues: e.g. JobSumitQ,
InvokeClientQ, InvokeWorkerQ .
Main benefit is that this approach is extensible: provides for a higher level of decoupling
between message producer and receiver since clients and workers easily added without change
to the broker.
Can also combine this approach with multiple channels as required (hybrid approach).
Core Component
Batch / Worker Agent
Enacts the Bulk Data Copy Activity as a fault
tolerant batch job for copying between
sources and sinks.
Scopes, checkpoints and restarts.
Batch / Worker Agent
•
Role is to enact the data copy activity according to the activity document, report status
events and respond to control messages.
•
Copy activity is a batch processing task (automated processing of large volumes of
information is most efficiently processed without user interaction).
•
DTS worker based on Spring Batch and Commons VFS (contract driven approach facilitates
different implementations e.g. scripts / shelling out to command line client).
•
Spring Batch provides framework for functions that are essential in batch processing e.g.
split/monitor/merge, logging/tracing, tx management, processing statistics, job pause and
restart, skip, retry, check-pointing.
A Spring Bach implementation deals with
breaking apart the business logic and
sharing it efficiently between parallel
processes or processors as step-jobs.
http://static.springsource.org/springbatch/index.html
Core Component
Message Model
Bulk Data Copy Activity Document.
Control Messages (stop, start, cancel)
Event Messages (faults, status, instance
attributes)
Message Model Requirements
Document Message
• Bulk Data Copy Activity description
• Captures all information required to connect to each source and sink URI and
subsequently enact the activity.
• Transfer requirements e.g. URI Properties, file selectors (reg-expression), scheduling
(batch-window), retry count, source/sink alternatives, checksums?, sequential ordering?
DAG?
• Serialized user credentials.
• Probably adopt/extend the Data End Point Reference (DEPR) construct from DMI. A
specialized form of WS-Address element which does not mandate any particular
URL/transport scheme, multiple <DataLocations/>
Control Messages
• Interact with a state/lifecycle model (e.g. stop, resume, cancel)
Event Messages
• Standard fault types and status updates
Information Model
• To advertise the service capabilities / properties / supported protocols
Existing/In-Scope Specifications
Related Specifications
1. Job Submission Description Language (JSDL)
• An activity description language for generic compute applications.
2. OGSA Data Movement Interface (DMI)
• Low level schema for defining the transfer of bytes between and single source and sink.
3. JSDL HPC File Staging Profile (HPCFS)
• Designed to address file staging not bulk copying.
4. OGSA Basic Execution Service (BES)
• Defines a basic framework for defining and interacting with generic compute activities: JSDL
+ extensible state and information models.
•
Neither fully captures our requirements (this is not a criticism of these specs, they are designed to
address their existing use-cases which only partially overlap with the requirements for a bulk data
copy activity).
Proprietary
• Condor Stork - based on Condor Class-Ads
• Glite JDL (again based on a Class-Ads)
• Not sure if Globus has/intends a similar definition in its new developments (e.g. SaaS) anyone ?
JSDL Data Staging 1 and the HPC File Staging Profile
<jsdl:DataStaging>
<jsdl:FileName>fileA</jsdl:FileName>
<jsdl:CreationFlag>overwrite</jsdl:CreationFlag>
<jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination>
<jsdl:Source>
<jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI>
</jsdl:Source>
<jsdl:Target>
<jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI>
</jsdl:Target>
<Credentials> … </Credentials>
</jsdl:DataStaging>
define both the source and target within the same <DataStaging/> element
which is permitted in JSDL.
However, the HPC File Staging Profile (Wasson et al. 2008), which is an
extension to JSDL, limits the use of credentials to a single credential definition
within a data staging element. Often, different credentials will be required for the
source and the target.
<jsdl:DataStaging>
<jsdl:FileName>fileA</jsdl:FileName>
<jsdl:FilesystemName>DL_HOME</jsdl:FilesystemName>
<jsdl:CreationFlag>overwrite</jsdl:CreationFlag>
<jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination>
<jsdl:Source>
<jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI>
</jsdl:Source>
<Credentials> … </Credentials>
</jsdl:DataStaging>
JSDL Data Staging 2
<jsdl:DataStaging>
<jsdl:FileName>fileA</jsdl:FileName>
<jsdl:FilesystemName>NGS_HOME</jsdl:FilesystemName>
<jsdl:CreationFlag>overwrite</jsdl:CreationFlag>
<jsdl:Target>
<jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI>
</jsdl:Target>
<Credentials> … </Credentials>
</jsdl:DataStaging>
Coupled staging elements; A source data staging element for fileA and a corresponding target
element for staging out of the same file. By specifying that the input file is deleted after the job
has executed, this example simulates the effect of a data copy from one location to another
through the staging host.
No multiple data locations (alternative sources and sinks).
More elements required (e.g. transfer requirements, file selectors, uri properties).
Intended for compute and data staging, not really bulk data copying.
OGSA DMI
The OGSA Data Movement Interface (DMI) (Antonioletti et al. 2008) defines a
number of XML constructs for describing and interacting with a data transfer activity.
The data source and destination are each described separately with a Data End
Point Reference (DEPRs), which is a specialized form of WS-Address element (Box
et al. 2004).
In contrast to the JSDL data staging model, a DEPR facilitates the definition of one
or more <Data/> elements within a <DataLocations/> element. This is used to
define alternative locations for the data source and/or sink. In doing this, an
implementation is then free to select between its supported protocols and retry
different source/sink combinations from the available list. This improves resilience
and the likelihood of performing a successful data transfer by matching protocols
supported by the service.
DEPR Example
<dmi:SourceDataEPR>
<wsa:Address>http://www.ogf.org/ogsa/2007/08/addressing/none</wsa:Address>
<wsa:Metadata>
<dmi:DataLocations>
<dmi:Data ProtocolUri="http://www.ogf.org/ogsadmi/2006/03/im/protocol/gridftp-v20"
DataUrl="gsiftp://example.org/name/of/the/dir/">
<dmi:Credentials><wsse:UsernameToken/></dmi:Credentials>
<other stuff/>
</dmi:Data>
<dmi:Data ProtocolUri="urn:my-project:srm"
DataUrl="srm://example.org/name/of/the/dir/">
<dmi:Credentials><wsse:UsernameToken/></dmi:Credentials>
<other stuff/>
</dmi:Data>
</dmi:DataLocations>
</wsa:Metadata>
</dmi:SourceDataEPR>
Defines alternative locations for the data
source and/or sink.
<dmi:SinkDataEPR>
. . . Similar to above but for the sink . . .
</dmi:SinkDataEPR>
DMI cont..
There are some limitations:
DMI is intended to describe only a single data transfer operation between one source
and one sink. To do several transfers, multiple invocations of a DMI service factory would
be required to create multiple DMI service instances.
We require a single (atomic) message packet that wraps multiple transfers that can be
delivery transacted, e.g. through a message routers.
Some of the existing constructs require extension / slight modification.
Therefore: DMI v2 strawman proposal at OGF to canvass some new extensions and to
propose a new bulk-copy doc that builds on DMI.
Bulk Data Copy Doc and JSDL Integration ?
<jsdl:JobDefinition>
<jsdl:JobDescription>
<jsdl:JobIdentification ... />
<jsdl:Application>
<!-- Option a) Embed BulkDataCopy document -->
<other:BulkDataCopy ... />
<!-- If Basic Profile compliance is important -->
<jsdl-hpcpa:HPCProfileApplication>
<jsdl-hpcpa:Executable>/usr/bin/datacopyagent.sh<jsdl-hpcpa:Executable>
<jsdl-hpcpa:Argument>‘myBulkDataCopyDoc.xml’</jsdl-hpcpa:Argument>
...
</jsdl-hpcpa:HPCProfileApplication>
</jsdl:Application>
<jsdl:Resources>
<!-- Option b) Stage-in BulkDataCopy document -->
<jsdl:DataStaging>
<jsdl:FileName>myBulkDataCopyDoc.xm</jsdl:FileName>
...
</jsdl:DataStaging>
</jsdl:Resources>
</jsdl:JobDescription>
</jsdl:JobDefinition>
Possible? options for integrating the proposed <BulkDataCopy/> document within JSDL; a)
nesting within the <jsdl:Application/> element or b) staging-in of a <BulkDataCopy/> document
as input for the named executable - why not ?