Transcript Document

IRODS workshop, September 26-28 2012 , Linköping (Sweden)
iRODS experience in DEISA
Agnès Ansari – [email protected]
The iRODS usability for the DEISA
infrastructure activity
• Investigate the current data management and access
needs within DEISA
• Find use cases that are representative of the data
management trends
• Investigate iRODS customization capabilities
• Customize iRODS as needed for the use cases
• Evaluate how iRODS match these requirements
2
The data management and access
questionnaire
• The aim of this questionnaire was to evaluate the iRODS
applicability for DEISA (Nov 2009)
• It describes and enhances the knowledge about users data
management, access and organization practices
• We got back filled forms from end-users from IDRIS, HLRS, FZJ
and LRZ WP7 teams
•It can be useful for DECI projects, virtual organizations or any
HPC projects
3
Topic
Description
Research domain and organization
primary area of research and project organization
Describe the scientific domain, the team /organization size
computation organization
Specify where are the data and where the computations run (single, main, distributed sites)
Data processing types and scenarios
data access control and security
Specify if the data is sensitive or not and if encryption is needed
kind of computation
Specify if simulations or data/analysis processing are computationally or I/O intensive
overall data volume for the project
Specify the range: less than 10GB, up to more than 100TB
I/O volume for a typical processing job
Specify the range: from 100MB up to 1TB
typical I/O schemes
Specify – Write type jobs/ Read type jobs/ Read-Write type jobs/ Read-Update type jobs
Data management
data access type
Describe the use of a data base, a Posix interface, an I/O software layer(HDF5), parallel I/O
file access, fraction of the file read in, multiple
reads scenarios
Describe multiple readings of input files, fraction of the files read in, random access or
random readings of multiple input files
file creations and data writes
Describe if temporary or permanent storage is used during jobs execution
data storage
Specify if the data is stored at one site, spread over multiple sites (from/to), on a workstation
Files management
Describes the files organization in directories/sub-directories or the use of meta-data
Data transfer tools
Describe the tools used to transfer data
4
Data management questionnaire feedback (1)
• Filled by DECI projects from different research domain:
engineering, particle physics, ...
• Project organization: rather small groups (<5 persons) and some
collaborations (<20 persons)
• Computation organization: data and computation are distributed
over several sites, so data has to be exchanged between sites
• In most of cases, data is considered as non-sensitive so usually all
group members can access the data and data encryption is not
needed
• Protocols used: scp, rsync, gridftp
5
Data management questionnaire feedback (2)
• Computationally intensive simulations are performed rather than I/O intensive
simulation or data analysis/processing: I/O rate is < 1MB I/O disk per 1 sec of CPU
time (low I/O)
• Data volume per project: 1 to 100TB
• I/O volume per job: 50 to 500GB
• I/O scheme: read input files/data – write output files/data (read-write type jobs)
rather than read-update type jobs
• Files are organized in directories/sub-directories. The files identification is based
on the files names (using specific naming conventions)
• Data access: direct access to data or files, or parallel I/O
• File access: complete file reading or multiples readings. Random access or
random readings are seldom used
• Data storage: at one computing site or spread over several (GPFS is not the only
data access method)
6
Use case for using iRODS in DEISA
• 2 DECI projects collaboration running jobs over 2 DEISA sites
– Management of distributed and shared data
– Heterogeneous computing environment
– Heterogeneous set of data storage spaces
– Various users profiles (DEISA and non DEISA accounts) that
have to collaborate by sharing data
7
Distributed storage
Permanent storage
Archive storage

permanent
storage
HOME/WORKDIR
local user
permanent
storage
HOME/WORKDIR
local user
DEISA user
Local user

permanent
storage
HOME/WORKDIR
DEISA user
permanent
storage
HOME/WORKDIR
DEISA user
GPFS
archive
storage
HOME/WORKDIR
Distributed
local user
Storage
(specific node)
DEISA_HOME/
DEISA_DATA
DEISA user
DEISA Site_a
Distributed
Storage
(specific node)
DEISA_HOME/
DEISA_DATA
DEISA user
Distributed and shared
storage
DEISA user
archive
storage
HOME/WORKDIR
local user
DEISA Site_b
8
DEISA site_a
$DEISA_DATA/CR
Distributed/shared
storage, available on
login nodes only
Permanent storage
$WORKDIR (job
computing
environment)
DEISA
network
Data
Copy (I/O)
GPFS
localuser_site_a
Input
data/results
Input
data/results
DEISA
Site_b
archive storage
localuser_site_b
archive
storage
Data
Copy
(I/O)
Permanent storage
$WORK (job
computing
environment)
storage
server
The 3 steps workflow simulation
• Prototype used to match the use case
• It is composed of a set of 3 phases, run sequentially in 2 different
computing centers
– First phase: production
– Second phase: first processing of files produced in step 1 to
reduce data volume
– Third phase: final analysis of data from phase 2
10
The 3 steps simulation
iRODS database
Site 1
Production
Processing
Analysis
Site 2
Production
Processing
• Cubic mesh composed of a set of cells following the
X, Y, Z axes as a time function T
• The modeling over the 2 sites is specified by a time
range (time_1, time_2)
iRODS database
IRODS resources
distributed and shared storage
• Production step: file set E1 with related metadata on each site
• Processing step: subsets of E1 processed to produce E2 set with related
metadata on each site
• Analysis step: subsets of all E2 analysed to produce E3 data set with
related metadata
11
Metadata attached during the production phase –
WebDavis – ARCS (Australian Research Collaboration Service)
12
Data Organization
Files management in the iRODs virtual data space
Access rights (ACL) for iRODS users
User
Distributed/ shared data
User sees a virtual data space
Virtual data collections
(similar to directories)
with virtual files span over
multiples physical resources
physical
files at IDRIS
physical
files at HLRS
Analysis
Production
Processing
User data collections
Proj_data
Ref_data
iRODS customization areas
• Disk quotas management (quota check before moving data)
– applied to almost all storage spaces including the
shared repository
• Cleaning policy if the disk quota space is reached
• Back up management
– applied to periodically back up the shared repository
during the project life
– applied at the end of the projects
– development of files and collection back up rules, to
back up data on a local or remote zone
• Users metadata management (instead of flat files that gather
data files information and location)
14
Conclusion on iRODS feasability for DEISA
The 3 steps simulation has shown the iRODS capability:
– to manage a distributed and shared context with various
storages and users profiles
– to store and retrieve data files
– to attach metadata and quey for metadata
– to define a logical data organization
– to manage disk quotas and cleaning policy
– to set up automatic back up procedures
15