Transcript Data Management
Data Management
by Cor Cornelisse
Introduction
Data Intensive Applications: Physics (particle accelerators) Simulated Science (super computers) The Large Hadron Collider (LHC) at CERN produces several petabytes of raw and derived data per year for approximately 15 years.
Introduction (cont’d)
File Storage Systems
Tree Storage System Meta Attributes Remote File Storage Distributed File Storage
Tree Storage System
Filesystem has 1 root directory Each directory has Files Directories
Meta Attributes
File name File size File type Last modified date Last accessed date Creation data Owner Permissions Description
Remote File Storage
Files are not stored on the local machine but on a remote machine Common Goal: Transparency for user and applications Usual implementation: Locator for file storage consisting of host and share name (Samba NFS)
Problem:
host files cannot be moved to a different
Distributed File Storage
Target: Keep actual host out of file locator Solution: Introduce Realms instead of single hosts Locator now points to Realm, path relative to that locator
Overall problem
Scenario: Wide diversity in Storage Systems All have their own protocols (which are often incompatible)
Solution
Layered client or gateway Extra Layer Sophisticated Hard to keep up with all the different protocols Common data transfer protocol Greater reliability Performance increase
Basic Data Management Mechanisms
GridFTP OGSA-DAI (Data Access and Integration) Metadata Catalog Service (MCS)
GridFTP
Extensions to FTP Protocol: Third-party control of data transfer Parallel data transfer Striped data transfer Partial file transfer Automatic negotiation of TCP buffer/window sizes Support for reliable and restart able data transfers
Striping
GridFTP (cont’d) – Implementations - 1
Globus_ftp_control_library: Separate channels allowing (parallel, striped an third-party data transfers) Control Channel (authentication, creation of control and data channels, reading and writing over data channels) Multiple Data Channels
GridFTP (cont’d) – Implementations - 2
Globus_ftp_client_library: Complete File get and put operations Set the level of parallelism Partial file transfer operations Third-party transfers Eventually functions to set TCP buffer sizes Support for Automatic negotiation of TCP Buffer/window sizes (not yet implemented)
GridFTP (cont’d) Performance
OGSA-DAI
Supports data access, insert and update Relational: MySQL, Oracle, DB2, SQL Server, Postgres XML: Xindice, eXist Files: CSV, BinX, EMBL, OMIM, SWISSPROT,… Supports data delivery SOAP over HTTP FTP; GridFTP E-mail Inter-service Supports data transformation XSLT ZIP; GZIP Supports security X.509 certificate based security
OGSA-DAI (cont’d)
Metadata Catalog Service
Logical file Logical collection Logical view Authorization Annotation Creation and transformation history User defined attributes
MCS (cont’d) overview
MCS (cont’d)
Replica Management
Maintain a mapping between logical names for files and collections and one or more physical locations Important for many applications Example: CERN HLT data Multiple petabytes of data per year Copy of everything at CERN (Tier 0) Subsets at national centers (Tier 1) Smaller regional centers (Tier 2) Individual researchers will have copies
Replica Management (cont’d)
Globus toolkit: Replica catalog definition LDAP object classes for representing logical-to-physical mappings in an LDAP catalog Low-level replica catalog API globus_replica_catalog library Manipulates replica catalog: add, delete, etc.
High-level reliable replication API globus_replica_manager library Combines calls to file transfer operations and calls to low-level API functions: create, destroy, etc.
Example Replica Catalog Logical Collection C02 measurements 1998 Filename: Jan 1998 Filename: Feb 1998 … Logical Collection C02 measurements 1999 Location jupiter.isi.edu
Filename: Mar 1998 Filename: Jun 1998 Filename: Oct 1998 Protocol: gsiftp UrlConstructor: gsiftp://jupiter.isi.edu/ nfs/v6/climate Location sprite.llnl.gov
Filename: Jan 1998 … Filename: Dec 1998 Protocol: ftp UrlConstructor: ftp://sprite.llnl.gov/ pub/pcmdi Logical File Parent Logical File Jan 1998 Size: 1468762 Logical File Feb 1998
The End