Data Management

Download Report

Transcript Data Management

Data Management

by Cor Cornelisse

Introduction

  Data Intensive Applications: Physics (particle accelerators) Simulated Science (super computers) The Large Hadron Collider (LHC) at CERN produces several petabytes of raw and derived data per year for approximately 15 years.

Introduction (cont’d)

File Storage Systems

Tree Storage System Meta Attributes Remote File Storage Distributed File Storage

Tree Storage System

Filesystem has 1 root directory Each directory has  Files  Directories

Meta Attributes

File name File size File type Last modified date Last accessed date Creation data Owner Permissions Description

Remote File Storage

Files are not stored on the local machine but on a remote machine Common Goal: Transparency for user and applications Usual implementation: Locator for file storage consisting of host and share name (Samba NFS)

Problem:

host files cannot be moved to a different

Distributed File Storage

Target: Keep actual host out of file locator Solution: Introduce Realms instead of single hosts Locator now points to Realm, path relative to that locator

Overall problem

Scenario:  Wide diversity in Storage Systems  All have their own protocols (which are often incompatible)

Solution

Layered client or gateway  Extra Layer  Sophisticated  Hard to keep up with all the different protocols Common data transfer protocol  Greater reliability  Performance increase

Basic Data Management Mechanisms

GridFTP OGSA-DAI (Data Access and Integration) Metadata Catalog Service (MCS)

GridFTP

Extensions to FTP Protocol:  Third-party control of data transfer  Parallel data transfer  Striped data transfer  Partial file transfer  Automatic negotiation of TCP buffer/window sizes  Support for reliable and restart able data transfers

Striping

GridFTP (cont’d) – Implementations - 1

Globus_ftp_control_library:  Separate channels allowing (parallel, striped an third-party data transfers)  Control Channel (authentication, creation of control and data channels, reading and writing over data channels)  Multiple Data Channels

GridFTP (cont’d) – Implementations - 2

Globus_ftp_client_library:  Complete File get and put operations  Set the level of parallelism  Partial file transfer operations  Third-party transfers  Eventually functions to set TCP buffer sizes  Support for Automatic negotiation of TCP Buffer/window sizes (not yet implemented)

GridFTP (cont’d) Performance

OGSA-DAI

Supports data access, insert and update    Relational: MySQL, Oracle, DB2, SQL Server, Postgres XML: Xindice, eXist Files: CSV, BinX, EMBL, OMIM, SWISSPROT,… Supports data delivery     SOAP over HTTP FTP; GridFTP E-mail Inter-service Supports data transformation   XSLT ZIP; GZIP Supports security  X.509 certificate based security

OGSA-DAI (cont’d)

Metadata Catalog Service

Logical file Logical collection Logical view Authorization Annotation Creation and transformation history User defined attributes

MCS (cont’d) overview

MCS (cont’d)

Replica Management

Maintain a mapping between logical names for files and collections and one or more physical locations Important for many applications  Example: CERN HLT data   Multiple petabytes of data per year Copy of everything at CERN (Tier 0)  Subsets at national centers (Tier 1)   Smaller regional centers (Tier 2) Individual researchers will have copies

Replica Management (cont’d)

Globus toolkit:    Replica catalog definition  LDAP object classes for representing logical-to-physical mappings in an LDAP catalog Low-level replica catalog API  globus_replica_catalog library  Manipulates replica catalog: add, delete, etc.

High-level reliable replication API   globus_replica_manager library Combines calls to file transfer operations and calls to low-level API functions: create, destroy, etc.

Example Replica Catalog Logical Collection C02 measurements 1998 Filename: Jan 1998 Filename: Feb 1998 … Logical Collection C02 measurements 1999 Location jupiter.isi.edu

Filename: Mar 1998 Filename: Jun 1998 Filename: Oct 1998 Protocol: gsiftp UrlConstructor: gsiftp://jupiter.isi.edu/ nfs/v6/climate Location sprite.llnl.gov

Filename: Jan 1998 … Filename: Dec 1998 Protocol: ftp UrlConstructor: ftp://sprite.llnl.gov/ pub/pcmdi Logical File Parent Logical File Jan 1998 Size: 1468762 Logical File Feb 1998

The End