Transcript Document
IRODS workshop, September 26-28 2012 , Linköping (Sweden) iRODS experience in DEISA Agnès Ansari – [email protected] The iRODS usability for the DEISA infrastructure activity • Investigate the current data management and access needs within DEISA • Find use cases that are representative of the data management trends • Investigate iRODS customization capabilities • Customize iRODS as needed for the use cases • Evaluate how iRODS match these requirements 2 The data management and access questionnaire • The aim of this questionnaire was to evaluate the iRODS applicability for DEISA (Nov 2009) • It describes and enhances the knowledge about users data management, access and organization practices • We got back filled forms from end-users from IDRIS, HLRS, FZJ and LRZ WP7 teams •It can be useful for DECI projects, virtual organizations or any HPC projects 3 Topic Description Research domain and organization primary area of research and project organization Describe the scientific domain, the team /organization size computation organization Specify where are the data and where the computations run (single, main, distributed sites) Data processing types and scenarios data access control and security Specify if the data is sensitive or not and if encryption is needed kind of computation Specify if simulations or data/analysis processing are computationally or I/O intensive overall data volume for the project Specify the range: less than 10GB, up to more than 100TB I/O volume for a typical processing job Specify the range: from 100MB up to 1TB typical I/O schemes Specify – Write type jobs/ Read type jobs/ Read-Write type jobs/ Read-Update type jobs Data management data access type Describe the use of a data base, a Posix interface, an I/O software layer(HDF5), parallel I/O file access, fraction of the file read in, multiple reads scenarios Describe multiple readings of input files, fraction of the files read in, random access or random readings of multiple input files file creations and data writes Describe if temporary or permanent storage is used during jobs execution data storage Specify if the data is stored at one site, spread over multiple sites (from/to), on a workstation Files management Describes the files organization in directories/sub-directories or the use of meta-data Data transfer tools Describe the tools used to transfer data 4 Data management questionnaire feedback (1) • Filled by DECI projects from different research domain: engineering, particle physics, ... • Project organization: rather small groups (<5 persons) and some collaborations (<20 persons) • Computation organization: data and computation are distributed over several sites, so data has to be exchanged between sites • In most of cases, data is considered as non-sensitive so usually all group members can access the data and data encryption is not needed • Protocols used: scp, rsync, gridftp 5 Data management questionnaire feedback (2) • Computationally intensive simulations are performed rather than I/O intensive simulation or data analysis/processing: I/O rate is < 1MB I/O disk per 1 sec of CPU time (low I/O) • Data volume per project: 1 to 100TB • I/O volume per job: 50 to 500GB • I/O scheme: read input files/data – write output files/data (read-write type jobs) rather than read-update type jobs • Files are organized in directories/sub-directories. The files identification is based on the files names (using specific naming conventions) • Data access: direct access to data or files, or parallel I/O • File access: complete file reading or multiples readings. Random access or random readings are seldom used • Data storage: at one computing site or spread over several (GPFS is not the only data access method) 6 Use case for using iRODS in DEISA • 2 DECI projects collaboration running jobs over 2 DEISA sites – Management of distributed and shared data – Heterogeneous computing environment – Heterogeneous set of data storage spaces – Various users profiles (DEISA and non DEISA accounts) that have to collaborate by sharing data 7 Distributed storage Permanent storage Archive storage permanent storage HOME/WORKDIR local user permanent storage HOME/WORKDIR local user DEISA user Local user permanent storage HOME/WORKDIR DEISA user permanent storage HOME/WORKDIR DEISA user GPFS archive storage HOME/WORKDIR Distributed local user Storage (specific node) DEISA_HOME/ DEISA_DATA DEISA user DEISA Site_a Distributed Storage (specific node) DEISA_HOME/ DEISA_DATA DEISA user Distributed and shared storage DEISA user archive storage HOME/WORKDIR local user DEISA Site_b 8 DEISA site_a $DEISA_DATA/CR Distributed/shared storage, available on login nodes only Permanent storage $WORKDIR (job computing environment) DEISA network Data Copy (I/O) GPFS localuser_site_a Input data/results Input data/results DEISA Site_b archive storage localuser_site_b archive storage Data Copy (I/O) Permanent storage $WORK (job computing environment) storage server The 3 steps workflow simulation • Prototype used to match the use case • It is composed of a set of 3 phases, run sequentially in 2 different computing centers – First phase: production – Second phase: first processing of files produced in step 1 to reduce data volume – Third phase: final analysis of data from phase 2 10 The 3 steps simulation iRODS database Site 1 Production Processing Analysis Site 2 Production Processing • Cubic mesh composed of a set of cells following the X, Y, Z axes as a time function T • The modeling over the 2 sites is specified by a time range (time_1, time_2) iRODS database IRODS resources distributed and shared storage • Production step: file set E1 with related metadata on each site • Processing step: subsets of E1 processed to produce E2 set with related metadata on each site • Analysis step: subsets of all E2 analysed to produce E3 data set with related metadata 11 Metadata attached during the production phase – WebDavis – ARCS (Australian Research Collaboration Service) 12 Data Organization Files management in the iRODs virtual data space Access rights (ACL) for iRODS users User Distributed/ shared data User sees a virtual data space Virtual data collections (similar to directories) with virtual files span over multiples physical resources physical files at IDRIS physical files at HLRS Analysis Production Processing User data collections Proj_data Ref_data iRODS customization areas • Disk quotas management (quota check before moving data) – applied to almost all storage spaces including the shared repository • Cleaning policy if the disk quota space is reached • Back up management – applied to periodically back up the shared repository during the project life – applied at the end of the projects – development of files and collection back up rules, to back up data on a local or remote zone • Users metadata management (instead of flat files that gather data files information and location) 14 Conclusion on iRODS feasability for DEISA The 3 steps simulation has shown the iRODS capability: – to manage a distributed and shared context with various storages and users profiles – to store and retrieve data files – to attach metadata and quey for metadata – to define a logical data organization – to manage disk quotas and cleaning policy – to set up automatic back up procedures 15