BaBar Data Distribution using the Storage Resource Broker

Download Report

Transcript BaBar Data Distribution using the Storage Resource Broker

BaBar Data Distribution using
the Storage Resource Broker
Adil Hasan, Wilko Kroeger (SLAC Computing Services),
Dominique Boutigny (LAPP),
Cristina Bulfon (INFN,Rome),
Jean-Yves Nief (ccin2p3),
Liliana Martin (Paris VI et VII),
Andreas Petzold (TUD),
Jim Cochran (ISU)
(on behalf of the BaBar Computing Group)
IX International Workshop on Advanced Computing and
Analysis Techniques in Physics Research
KEK Japan
1-5 December 2003
1
BaBar – the parameters
(computing-wise)
• ~80 institutions in Europe and North America.
• 5 Tier A computing centers:
– SLAC (USA), ccin2p3 (France), RAL (UK), GridKA
(Germany), Padova (Italy).
• Processing of data done in Padova.
• Bulk of simulation production done by remote
institutions.
• BaBar computing is highly distributed.
• Reliable data distribution essential to BaBar.
2
The SRB
• The Storage Resource Broker (SRB) is developed
by San Diego Supercomputing Center (SDSC).
• A client-server middleware for connecting
heterogeneous data resources.
• Provides a uniform method to access the resources.
• Provides relational database backend to record file
metadata (metadata catalog called MCAT) and for
access control lists (acls).
• Can use Grid Security Infrastructure (GSI)
authentication.
• Also provides Audit information.
3
The SRB
• SRB v3:
– Define an SRB zone comprising of one MCAT and one
or more SRB servers.
– Provides applications to federate zones (synch MCATs,
create users, data belonging to different zones).
– Within one federation all SRB servers need to run on
the same port.
– Allows an SRB server at one site to belong to more
than one zone.
4
The SRB in BaBar
• The SRB feature-set makes it a useful tool
for data distribution.
• Particle Physics Data Grid (PPDG) effort
initiated interest in SRB.
• PPDG and BaBar collaboration effort has
gone into testing and deploying the SRB in
BaBar.
5
The SRB in BaBar
• The BaBar system has 2 MCATs: one at
SLAC and one at ccin2p3.
• Use SRB v3 to create and federate the two
zones: SLAC and ccin2p3 zone.
• Advantage that client can connect to SLAC
or ccin2p3 to see files at other site.
6
The SRB in BaBar
ccin2p3
SLAC
Data copied from
SLAC
Data copied
From ccin2p3
Data replicate
from/copied to
SLAC
SLAC Zone
MCAT enabled SRB server
SRB server
SRB clients
7
ccin2p3 Zone
Data replicate
from/copied to
ccin2p3
Data Distribution using SRB
• BaBar Data distribution with SRB consists
of the following steps:
– Publish files available for distribution in MCAT
(publication I).
– Locate files to distribute (location).
– Distribute files (distribution).
– Publish distributed files (publication II).
• Each of these steps requires the user to
belong to some ACL (authorization).
8
Authorization
• BaBarGrid currently uses European Data Grid
Virtual Organization (VO).
– Consists of an Lightweight Directory Access Protocol
(LDAP) database holding certificate Distinguished
Name (DN) strings for all BaBar members.
– Used to update Globus grid-mapfiles.
• SRB authentication akin to grid-mapfile:
– Maps SRB username to DN string.
– SRB username doesn’t have to map to UNIX username.
• Developing application to obtain user DN strings
from VO.
9
– App is experiment neutral.
– Has ability to include information from Virtual
Organization Management System.
Publication I
• The initial publication step (event store files)
entails:
– Publication of files into SRB MCAT once files have
been produced and published in BaBar bookkeeping.
– Files are grouped into collections based on run range,
release, production type (SRB collection != BaBar
collection).
– Extra metadata information (such as file UUID, BaBar
collection name) stored in MCAT.
– SRB object name contains processing spec, etc that
uniquely id the object.
– ~5K event files (or SRB objects) per SRB collection.
10
Publication I
• Detector conditions files are more complicated as
files are constantly updated (ie not closed).
• As files are update in SRB need to prevent users
from taking an inconsistent copy.
• Unfortunately SRB does not currently permit
locking of collections.
11
Publication I
• Have devised a workaround:
– Register conditions file objects under date-specified
collection.
– Register a ‘locator file’ object containing the conditions
date-specified collection name.
– Then, new conditions files registered under a new datespecified collection.
– ‘Locator file’ contents updated with new date-specified
collection.
• This method prevents users from taking an
inconsistent set of files.
• Only two sets kept at any one time.
12
Location & Distribution
• Location and distribution happen in one client
application.
• User supplies BaBar collection name from BaBar
bookkeeping.
• SRB searches MCAT for files that have that
collection name as metadata.
• Files are then copied to target site.
• SRB allows simple checksum to be performed.
– But checksum is not md5 or cksum.
– Still can be useful.
13
Location & Distribution
• SRB allows 3rd-party replication.
– But, most likely we will always run distribution
command from source or target site.
• Also have the ability to create a logical resource of
more than 1 physical resource.
– Can replicate to all resources with one command.
– Useful if more than 1 site regularly needs the data.
14
Publication II
• Optionally can register copied file in MCAT
(decision a matter of policy).
• Extra step for data distribution to ccin2p3:
– Publication of files in ccin2p3 MCAT.
– Required since current SRB v3 does not allow
replication across zones.
• Extra step not a problem since need to integrity
check data before publishing anyway.
• Important note: data can be published & accessed
at ccin2p3 or SLAC since MCATs will be synch’d
regularly.
15
SC2003 demonstration
• Demonstrated distribution of detector conditions
files using scheme previously described to 5 sites:
– SLAC, ccin2p3, Rome, Bristol, Iowa State.
• Data were distributed over 2 servers at SLAC and
files were copied in a round-robin manner from
each server.
• Files were continuously copied and deleted at
target site.
• Demonstration ran 1 full week continuously
without problems.
16
SC2003 Demonstration
MCAT
Authenticate user
Locate data
SRB server
Transfer data
to target
Transfer data
to target
Request to
transfer data
Request to
transfer data
Rome
17
Ccin2p3
... Etc
18
SC 2003 demonstration
Future work
• System currently being used, but not yet
considered full production quality. Missing items:
– SRB log file parser to automatically catch errors.
– SRB server load monitor (cpu, memory).
– Automatic generation of SRB .MdasEnv and
.MdasAuth files.
– Automatic generation of new and deletion of old users
in MCAT.
– Better packaging of client and server apps.
– MCAT integrity checking scripts.
– Better integration with BaBar Bookkeeping.
19
Future work
• Slightly longer term:
– Inclusion of management system to manage SRB
requests.
• If system heavily used will require system to queue requests.
• Currently looking at Stork to manage multiple requests
(http://www.cs.wisc.edu/condor/stork/).
– Interoperation with Replica Location Service (RLS)
and Storage Resource Manager (SRM) (see Simon
Metson’s talk).
• Allows integration with LCG tools.
– Move to grid-services.
20
Summary
• Extensive testing and interaction with SRB
developers has allowed BaBar to develop a data
distribution system based on existing grid middleware.
• Used for distributing conditions files to 5 sites
since October.
• Will be used to distribute:
– Detector conditions files.
– Event store files.
– Random trigger files (used for simulation).
21
• BaBar’s data distribution system is sufficiently
modular can be adapted to other environments.