More details on the gLite IS - Dipartimento di Matematica

Download Report

Transcript More details on the gLite IS - Dipartimento di Matematica

Enabling Grids for E-sciencE
Architecture of
gLite Data Management System
Tony Calanducci
INFN Catania
International Summer School on Grid Computing 2006
Ischia (Naples), July 09-21th 2006
www.eu-egee.org
EGEE-II INFSO-RI-031688
Outline
Enabling Grids for E-sciencE
• Grid Data Management Challenge
• Storage Elements and SRM
• File Catalogs and DM tools
• Metadata Services
• File Transfer Services
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
2
The Grid DM Challenge
Enabling Grids for E-sciencE
• Heterogeneity
– Data are stored on different
storage systems using different
access technologies
– Need common interface to
storage resources
 Storage Resource
Manager (SRM)
• Distribution
– Data are stored in different
locations – in most cases there is
no shared file system or common
namespace
– Data need to be moved between
different locations
EGEE-II INFSO-RI-031688
– Need to keep track where
data are stored
 File and Replica Catalogs
– Need scheduled, reliable
file transfer
 File transfer service
ISSGC’06, Ischia, 09-21 July 2006
3
Introduction
Enabling Grids for E-sciencE
• Assumptions:
– Users and programs produce and require data
– the lowest granularity of the data is on the file level (we deal
with files rather than data objects or tables)
 Data = files
• Files:
–
–
–
–
–
Mostly, write once, read many
Located in Storage Elements (SEs)
Several replicas of one file in different sites
Accessible by Grid users and applications from “anywhere”
Locatable by the WMS (data requirements in JDL)
• Also…
– WMS can send (small amounts of) data to/from jobs: Input and
Output Sandbox
– Files may be copied from/to local filesystems (WNs, UIs) to the Grid
(SEs)
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
4
gLite Grid Storage Requirements
Enabling Grids for E-sciencE
• Def: The Storage Element is the service which allows a user or an
application to store data for future retrieval
• Manage local storage (disk) and/or interface to complex Mass
Storage Systems (disk arrays and tape libraries) like
– HPSS, CASTOR, DiskeXtender (UNITREE), …
• Offer a unique virtual file system even if it uses different storage
techologies (array of disks and tapes), hiding the details to the
users (providing an SRM interface)
• Support basic file transfer protocols
– GridFTP mandatory (GSI enabled FTP)
– Others if available (https, ftp, etc)
• Support a native I/O (remote file) access protocol
– POSIX (like) I/O client library for direct access of data
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
6
SRM in an example
Enabling Grids for E-sciencE
She is running a job which needs:
Data for physics event reconstruction
Simulated Data
They are at CERN
Some data analysis files
In dCache
She will write files remotely too
They are at Fermilab
In a disk array
They are at Nikhef
in a classic SE
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
7
SRM in an example
Enabling Grids for E-sciencE
dCache
classic SE
Independent system from
dCache or Castor
SRM
Own system, own protocols
and parameters
I talk to them on your
You as a
behalf
need space
I will user
even allocate
for your
files
to know
all
And I will use transfer
protocols the
to send your
files systems!!!
there
Castor
No connection with
dCache or classic SE
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
8
Storage Resource Management
Enabling Grids for E-sciencE
•
The SRM (Storage Resource Manager) is a protocol for Storage Resource
Management.
– it does not do any data transfer.
– used to ask a Mass Storage System (MSS) to make a file ready for
transfer, or to create space in a disk cache to which a file can be uploaded
– The actual transfer is done using the file transfer protocol supported by
the backend MSS
•
Storage resource management needs to take into account
–
–
–
–
–
•
Transparent access to files (migration to/from disk pool)
File pinning
Space reservation
File status notification
Life time management
The SRM (Storage Resource Manager) is a single interface that takes care
of local storage interaction and provides a Grid interface to the outside
world
– In gLite, interactions with the SRM interface are hidden by higher level
tools (DM tools and APIs)
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
9
gLite Storage Element
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
13
Files Naming conventions
Enabling Grids for E-sciencE
• Logical File Name (LFN)
– An alias created by a user to refer to some item of data, e.g.
“lfn:/grid/gilda/tony/simple2.dat”
• Globally Unique Identifier (GUID)
– A non-human-readable unique identifier for an item of data, e.g.
“guid:3a69a819-2023-4400-a2a1-f581ab942044”
• Site URL (SURL)
– Gives indication on which place (Storage Element) the file is actually found.
– Understood by the SRM interface
– “srm://aliserv6.ct.infn.it/dpm/ct.infn.it/home/gilda/generated/2006-0710/filef7a916f7-159b-48df-9159-877f2d3c6f58”
• Transport URL (TURL)
– Temporary locator of a replica+access protocol: understood by the backend MSS
“gsiftp://aliserv6.ct.infn.it/aliserv6.ct.infn.it:/gpfs/dpm/gilda/2006-0710/filef7a916f7-159b-48df-9159-877f2d3c6f58.46193.0”
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
14
SRM Interactions
Enabling Grids for E-sciencE
4
Client
1
5
SRM
2
3
Storage
1.
2.
3.
4.
5.
The client asks the SRM for a file providing an SURL (Site URL)
The SRM asks the storage system to provide the file
The storage system notifies the availability of the file and its location
The SRM returns a TURL (Transfer URL), i.e. the location from where the
file can be accessed
The client interacts with the storage using the protocol specified in the
TURL
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
15
What is a file catalog
Enabling Grids for E-sciencE
File Catalog
SE
SE
gLite
UI
EGEE-II INFSO-RI-031688
SE
ISSGC’06, Ischia, 09-21 July 2006
16
The LFC (LCG File Catalog)
Enabling Grids for E-sciencE
• It keeps track of the location of copies (replicas) of Grid files
• LFN acts as main key in the database. It has:
–
–
–
–
–
Symbolic links to it (additional LFNs)
Unique Identifier (GUID)
System metadata
Information on replicas
One field of user metadata
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
17
LFC commands
Enabling Grids for E-sciencE
Summary of the LFC Catalog commands
lfc-chmod
Change access mode of the LFC file/directory
lfc-chown
Change owner and group of the LFC file-directory
lfc-delcomment
Delete the comment associated with the file/directory
lfc-getacl
Get file/directory access control lists
lfc-ln
Make a symbolic link to a file/directory
lfc-ls
List file/directory entries in a directory
lfc-mkdir
Create a directory
lfc-rename
Rename a file/directory
lfc-rm
Remove a file/directory
lfc-setacl
Set file/directory access control lists
lfc-setcomment
Add/replace a comment
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
19
LFC C API
Enabling Grids for E-sciencE
Low level methods (many POSIX-like):
lfc_access
lfc_deleteclass
lfc_listreplica
lfc_aborttrans
lfc_delreplica
lfc_lstat
lfc_addreplica
lfc_endtrans
lfc_mkdir
lfc_apiinit
lfc_enterclass
lfc_modifyclass
lfc_chclass
lfc_errmsg
lfc_opendir
lfc_chdir
lfc_getacl
lfc_queryclass
lfc_chmod
lfc_getcomment
lfc_readdir
lfc_chown
lfc_getcwd
lfc_readlink
lfc_closedir
lfc_getpath
lfc_rename
lfc_creat
lfc_lchown
lfc_rewind
lfc_delcomment
lfc_listclass
lfc_rmdir
lfc_delete
lfc_listlinks
lfc_selectsrvr
EGEE-II INFSO-RI-031688
lfc_setacl
lfc_setatime
lfc_setcomment
lfc_seterrbuf
lfc_setfsize
lfc_starttrans
lfc_stat
lfc_symlink
lfc_umask
lfc_undelete
lfc_unlink
lfc_utime
send2lfc
ISSGC’06, Ischia, 09-21 July 2006
23
GFAL: Grid File Access Library
Enabling Grids for E-sciencE
Interactions with SE require some components:
→ File catalog services to locate replicas
→ SRM interfaces
→ File access mechanism to access files from the SE on the UI/WN
GFAL does all this tasks for you:
→ Hides all these operations
→ Presents a POSIX interface for the I/O operations
→ Single shared library in threaded and unthreaded versions
libgfal.so, libgfal_pthr.so
→ Single header file
gfal_api.h
→ User can create all commands needed for storage management
→ It offers as well an interface to SRM
Supported protocols:
→ file (local or nfs-like access)
→ dcap, gsidcap and kdcap (dCache access)
→ rfio (castor access) and gsirfio (dpm)
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
24
GFAL: File I/O API (I)
Enabling Grids for E-sciencE
int gfal_access (const char *path, int amode);
int gfal_chmod (const char *path, mode_t mode);
int gfal_close (int fd);
int gfal_creat (const char *filename, mode_t mode);
off_t gfal_lseek (int fd, off_t offset, int whence);
int gfal_open (const char * filename, int flags, mode_t mode);
ssize_t gfal_read (int fd, void *buf, size_t size);
int gfal_rename (const char *old_name, const char *new_name);
ssize_t gfal_setfilchg (int, const void *, size_t);
int gfal_stat (const char *filename, struct stat *statbuf);
int gfal_unlink (const char *filename);
ssize_t gfal_write (int fd, const void *buf, size_t size);
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
25
GFAL Java API
Enabling Grids for E-sciencE
• GFAL API are available for C/C++ programmers
• Because of ISSGC’06 exercise requirements, we needed to have a
Java version of them
• We wrote a wrapper around the C APIs using Java Native
Interface and a the Java APIs on top of it
• More information can be found here:
https://grid.ct.infn.it/twiki/bin/view/GILDA/APIGFAL
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
28
lcg-utils DM tools
Enabling Grids for E-sciencE
• High level interface (CL tools and APIs) to
– Upload/download files to/from the Grid (UI,CE and WN <--->
SEs)
– Replicate data between SEs and locate the best replica available
– Interact with the file catalog
• Definition: A file is considered to be a Grid File if it is
both physically present in a SE and registered in the
File Catalog
• lcg-utils ensure the consistency between files in the
Storage Elements and entries in the File Catalog
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
29
lcg-utils commands
Enabling Grids for E-sciencE
Replica Management
lcg-cp
Copies a grid file to a local destination
lcg-cr
Copies a file to a SE and registers the file in the catalog
lcg-del
Delete one file
lcg-rep
Replication between SEs and registration of the replica
lcg-gt
Gets the TURL for a given SURL and transfer protocol
lcg-sd
Sets file status to “Done” for a given SURL in a SRM request
File Catalog Interaction
lcg-aa
Add an alias in LFC for a given GUID
lcg-ra
Remove an alias in LFC for a given GUID
lcg-rf
Registers in LFC a file placed in a SE
lcg-uf
Unregisters in LFC a file placed in a SE
lcg-la
Lists the alias for a given SURL, GUID or LFN
lcg-lg
Get the GUID for a given LFN or SURL
lcg-lr
Lists the replicas for a given GUID, SURL or LFN
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
30
LFC interfaces
Enabling Grids for E-sciencE
SEs
LCG
UTILS
GFAL
Python
WMS
LFC
CLIENT
C API
LFC
SERVER
DLI
CLI
lfc-ls, lfc-mkdir,
lfc-setacl, …
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
31
Metadata on the Grid
Enabling Grids for E-sciencE
• Metadata is data about data
• On the Grid: mainly, information about files
–
–
–
–
Describe files
Locate files based on their contents
They can also add details on running jobs
…
• But also simplified DB access on the Grid
– Many Grid applications need structured data
– Many applications require only simple schemas
 Can be modelled as metadata
– Main advantage: better integration with the Grid environment
 Metadata Service is a Grid component
 Grid security
 Hide DB heterogeneity
• AMGA is the Metadata Component of gLite
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
38
Example
Enabling Grids for E-sciencE
• Suppose we have a set of movie trailers saved on
several storage elements
$ lfc-ls -l /grid/gilda/trailers
-rw-rw-r-1 101
102
-rw-rw-r-1 109
102
-rw-rw-r-1 101
102
-rw-rw-r-1 101
102
-rw-rw-r-1 101
102
-rw-rw-r-1 192
102
-rw-rw-r-1 101
102
-rw-rw-r-1 101
102
-rw-rw-r-1 192
102
-rw-rw-r-1 101
102
10188804
3201028
3545092
5277700
5828612
20509586
5912580
5812228
12918756
6240260
Apr
Apr
Apr
Apr
Apr
Apr
Apr
Apr
Apr
Apr
14
14
14
14
14
20
14
14
20
14
17:21
19:34
17:19
17:27
17:28
14:08
17:31
17:30
19:09
17:30
BatmanBegins.mpg
alien.mpg
amelie.mpg
american2.mpg
fastfurious.mpg
insideman.avi
madagascar.mpg
matrix.mpg
pinkpanther.mov
spiderman.mpg
• We could add more details (Movie Title, Cast, Runtime, PlotOutline,
Genre, Director) on their contents associating them Metadata.
• We could then look for movies that satisfy some desired search
critiria (e.g.: movies that are commedies where our preferred actor
perfomed or are about animals and zoos)
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
39
Metadata Concepts
Enabling Grids for E-sciencE
• Basic Definitions
– Entries - List of items to which we want attach metadata to
(ex: each movie will rapresented as an entry in AMGA)
– Attribute – key/value pair with type information
 Name/Key – The name of the attribute
(ex: MovieTitle, Cast, PlotOutline, Runtime, …)
 Type – The type
(ex: varchar, int, float, text, numeric, …)
 Value - Value of an entry's attribute
(ex: “Spider Man 2”, “Tobey Maguire, Kirsten Dunst”, 127, …)
–
–
–
–
Metadata - List of attributes associated with entries
Schema – A set of attributes
Collection – A set of entries associated with a schema
We can think of collections as DB tables, schema as the list of
fields (with their types), attributes as columns, entries as rows
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
40
AMGA Features
Enabling Grids for E-sciencE
• Dynamic Schemas
– Schemas can be modified at runtime by client
 Create, delete schemas
 Add, remove attributes
• Metadata organised as an hierarchy
– Collections can contain sub-collections
– Analogy to file system:
 Collection  Directory; Entry  File
• Flexible Queries
– SQL-like query language
– Joins between schemas
– Example
selectattr /gLibrary:FileName /gLAudio:Author /gLAudio:Album
'/gLibrary:FILE=/gLAudio:FILE and like(/gLibrary:FileName, “%.mp3")‘
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
41
Security
Enabling Grids for E-sciencE
•
•
•
•
Unix style permissions
ACLs – Per-collection or per-entry.
Secure connections – SSL
Client Authentication based on
– Username/password
– General X509 certificates
– Grid-proxy and VOMS-proxy certificates
• Access control via a Virtual Organization Management System
(VOMS):
VOM
S
Authenticate
with X509
Cert
VOMS-Cert
with Group &
Role information
VOMS-Cert
EGEE-II INFSO-RI-031688
Resource
management
Orac
le
MGA
A
ISSGC’06, Ischia, 09-21 July 2006
AMGA Implementation
Enabling Grids for E-sciencE
• C++ multiprocess server
– Runs on any Linux flavour
• Backends
– Oracle, MySQL, PostgreSQL,
SQLite
Metadata Server
Oracle
Client
SOAP
MD
Server
Client
Postgre
SQL
MySQL
TCP
Streaming
• Two frontends
SQLite
– TCP Streaming
 High performance
 Client API for C++, Java,
Python, Perl, Ruby
– SOAP
Python Interpreter
 Interoperability
• Also implemented as
standalone Python library
Client
Metadata
Python
API
filesystem
– Data stored on filesystem
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
43
Enabling Grids for E-sciencE
GILDA Use Cases
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
47
gLibrary Use Case
Enabling Grids for E-sciencE
• Attempts to create a Multimedia Management System
on the Grid
– Examples of Multimedia Contents handled by gLibrary:







Images
Movies
Audio Files
Office Documents (Powerpoint, Word, Excel, OpenOffice)
E-Mails, PDFs, HTMLs
Customized versions of well-know document type (ex. EGEE PPTs)
….
• Keeps track and organizes in a uniform way all the
additional details (metadata) of files saved in Storage
Elements and registered in File Catalogues
• Provides users with an easy way to locate and retrieve
files based on their contents
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
48
gLibrary JAVA GUI screenshot
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
49
gLibrary Deployment scenario
Enabling Grids for E-sciencE
Authenticate
with X509
Certificate
VOMS Proxy
with Group &
Role Information
(gLibraryManager,
gLibrarySubmitter,
VO user)
UI
SE
File
Catalog
EGEE-II INFSO-RI-031688
SE
SE
ISSGC’06, Ischia, 09-21 July 2006
50
gMOD: grid Movie On Demand
Enabling Grids for E-sciencE
• gMOD provides a Video-On-Demand service
• User chooses among a list of video and the chosen one
is streamed in real time to the video client of the user’s
workstation
• For each movie a lot of details (Title, Runtime, Country,
Release Date, Genre, Director, Case, Plot Outline) are
stored and users can search a particular movie
querying on one or more attributes
• Two kind of users can interact with gMOD:
TrailersManagers that can administer the db of movies
(uploading new ones and attaching metadata to them);
GILDA VO users (guest) can browse, search and
choose a movie to be streamed.
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
51
gMOD interactions
Enabling Grids for E-sciencE
Metadata
Catalogue
VOMS
GENIUS Portal
get Role
AMGA
Storage
Elements
LFC
Catalogue
User
Workload Management System
CE
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
52
gMOD screenshot
Enabling Grids for E-sciencE
gMOD is accesible through the GENIUS Portal (https://glite-tutor.ct.infn.it)
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
53
Data movement introduction
Enabling Grids for E-sciencE
• Grids are naturally distributed systems
• The means that data also needs to be distributed
– First generation data distribution mainly concentrated on
copy protocols in a grid environment:
 gridftp
 http + mod_gridsite
 File movement started and controlled on the client side
• But copies controlled by clients have problems…
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
54
Direct Client Controlled Data Movement
Enabling Grids for E-sciencE
Client
Source Storage
Element
Data Flow
Channel
Control
Channels
Destination Storage
Element
• Although transport protocol may be robust, state is held
inside client – inconvenient and fragile.
• Client only knows about local state, no sense of global
knowledge about data transfers between storage elements.
– Storage elements overwhelmed with replication requests
– Multiple replications of the same data can happen
simultaneously
– Site has little control over balance of network resources - DoS
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
55
Transfer Service
Enabling Grids for E-sciencE
• Clear need for a service
for data transfer
– Client connects to service
to submit request
– Service maintains state
about transfer
– Client can periodically
reconnect to check status
or cancel request
– Service can have
knowledge of global state,
not just a single request
 Load balancing
 Scheduling
EGEE-II INFSO-RI-031688
•Submit new request
•Monitor progress
•Cancel request
Client
SOAP via https
Transfer
Service
Source
Storage
Element
Data
Flow
Control
Destination
Storage
Element
ISSGC’06, Ischia, 09-21 July 2006
56
gLite FTS: Channels
Enabling Grids for E-sciencE
• FTS Service has a concept of
channels
• A channel is a unidirectional
connection between two sites
• Transfer requests between these
two sites are assigned to that
channel
• Channels usually correspond to
a dedicated network pipe (e.g.,
OPN) associated with production
• But channels can also take
wildcards:
– * to MY_SITE : All incoming
– MY SITE to * : All outgoing
– * to * : Catch all
EGEE-II INFSO-RI-031688
• Channels control certain
transfer properties: transfer
concurrency, gridftp streams.
• Channels can be controlled
independently: started,
stopped, drained.
ISSGC’06, Ischia, 09-21 July 2006
58
Data Management Services Summary
Enabling Grids for E-sciencE
• Storage Elements – save data and provide a common
interface
– Storage Resource Manager (SRM) Castor, dCache,
DPM, …
– Native Access protocols
rfio, dcap, nfs, …
– Transfer protocols
gsiftp, ftp, …
• Catalogs – keep track where data are stored
– File Catalog
– Replica Catalog
– Metadata Catalog
LCG File Catalog (LFC)
AMGA Metadata Catalogue
• Data Movement – schedules reliable file transfer
– File Transfer Service
gLite FTS
(manages physical transfers)
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
62
References
Enabling Grids for E-sciencE
•
gLite documentation homepage
– http://glite.web.cern.ch/glite/documentation/default.asp
•
DM subsystem documentation
– http://egee-jra1-dm.web.cern.ch/egee-jra1-dm/doc.htm
•
LFC and DPM documentation
– https://uimon.cern.ch/twiki/bin/view/LCG/DataManagementDocu
mentation
•
AMGA Project Homepage
– http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/
•
FTS user guide
– https://edms.cern.ch/file/591792/1/EGEE-TECH-591792Transfer-CLI-v1.0.pdf
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
63
Questions…
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
ISSGC’06, Ischia, 09-21 July 2006
64