Transcript Slide 1

Integrated Rule Oriented Data System
(iRODS)
Reagan W. Moore
Arcot Rajasekar
Mike Wan
{moore,sekar,mwan}@diceresearch.org
http://irods.diceresearch.org
Data Management Infrastructure
• Assemble distributed data into a shared collection
–
–
–
–
Manage properties of the collection
Enforce management policies
Validate assessment criteria
Automate administrative tasks
• Support wide range of management applications
– Data sharing, publication, preservation, analysis
– Works at scale (petabytes, hundreds of millions of files)
Data Management Challenges
• Data driven research generates massive data collections
– Data sources are remote and distributed
– Collaborators are remote
– Wide variety of data types: observational data, experimental data,
simulation data, real-time data, office products, web pages, multi-media
• Collections contain millions of files
– Logical arrangement is needed for distributed data
– Discovery requires the addition of descriptive metadata
• Long-term retention requires migration of output into a
reference collection
– Automation of administrative functions is essential to minimize longterm labor support costs
– Creation of representation information for describing file context
– Validation of assessment criteria (authenticity, integrity)
Preservation Context
• Preservation metadata
– Authenticity (provenance) information
– Representation information (structure, semantics)
– Administrative information (replication, checksums, access
controls, retention, disposition)
• Preservation procedures
– Administration procedures
– ISO MOIMS-rac assessment procedures
– Preservation procedures generate preservation metadata
Overview
Architecture
Overviewof
of iRODS
iRODS Data
System
User
Can Search, Access, Add and
Manage Data
& Metadata
iRODS Data System
iRODS Data
Server
Disk, Tape, etc.
iRODS Rule
Engine
Track policies
iRODS
Metadata
Catalog
Track data
*Access data with Web-based Browser or iRODS GUI or Command Line clients.
iRODS Distributed Data Management
iRODS Resource Server
Types of File Manipulation
•
•
•
•
•
•
•
•
Replication
Load leveling across storage systems
Registration
Synchronization
Checksums
Aggregation
Metadata
Access controls (time dependent)
iRODS Micro-Services
• Function snippets that wrap a well-defined process
–
–
–
–
–
–
Compute checksum
Replicate file
Integrity check
Zoom image
Get SDSS image cutout
Search PubMed
• Written in C or Python (PHP, Java soon)
– Recovery micro-services to handle failure
– Web services can be wrapped as micro-services
• Can be chained to perform complex tasks
– Micro-services invoked by rule engine
iRODS Rules
• Server-side workflows
Action | condition | workflow chain | recovery chain
• Condition - test on any attribute:
– Collection, file name, storage system, file type, user group,
elapsed time, IRB approval flag, descriptive metadata
• Workflow chain:
– Micro-services / rules that are executed at the storage system
• Recovery chain:
– Micro-services / rules that are used to recover from errors
iput With Replication
iput
data
Client
icat
Resource 1
metadata
Metadata
Data
Resource 2
Rule
Base
Data
data
/<filesystem>
Rule
Base
Rule added to
rule database
Policy-Virtualization:
Automate Operations
• System-centric Policies & Obligations:
– Manage retention, disposition, distribution, replication, integrity,
authenticity, chain of custody, access controls, representation
information, descriptive information requirement, logical
arrangement, audit trails, authorization, authentication
• Domain-specific Policies:
– Identification & Extraction of Metadata
– Ingestion Control for Provenance Attribution
– Processing of Data on Ingestion
• Creation of multi-resolution images, type-identification, anonymization,…
– Processing of Data on Access
• IRB Approval for data access, Data sub-setting, Merging of multiple images,
conversion, redaction, …
Policy/rule execution
•
•
•
•
Immediate - enforced at time of action invocation
Deferred - applied at a future time
Periodic - applied at defined interval
Interactive - applied on demand
• iSEC scheduler / batch system supports
–
–
–
–
Local workflows
Distributed workflows
Deferred and periodic workflows
(Launch micro-services on clusters, clouds, supercomputers)
Checksum Validation Rule
myChecksumRule{
msiMakeQuery("DATA_NAME, COLL_NAME, DATA_CHECKSUM",*Condition,*Query);
msiExecStrCondQuery(*Query,*B);
assign(*A,0);
forEachExec (*B) {
msiGetValByKey(*B,COLL_NAME,*C);
msiGetValByKey(*B,DATA_NAME,*D);
msiGetValByKey(*B,DATA_CHECKSUM,*E);
msiDataObjChksum(*B,*Operation,*F);
ifExec (*E != *F) {
writeLine(stdout,file *C/*D has registered checksum *E and computed checksum *F);
}
else {
assign(*A,*A + 1);
}
}
ifExec(*A > 0) {
writeLine(stdout, have *A good files);
}
}
*Condition can be COLL_NAME like ‘/ils161/home/moore/genealogy/%’
Quota Checking Rule
mytestRule||
assign(*A,0)##
assign(*ContInx,1)##
assign(*G,0)##
msiMakeGenQuery("DATA_SIZE",*Condition,*Query)##
msiExecGenQuery(*Query,*B)##
forEachExec(*B,msiGetValByKey(*B,DATA_SIZE,*C)##
assign(*A,*A + *C)##
assign(*G, *G + 1),nop)##
` whileExec(*ContInx > 0, msiGetMoreRows(*Query, *B, *ContInx)##
forEachExec(*B,msiGetValByKey(*B,DATA_SIZE,*C)##
assign(*A,*A + *C)##
assign(*G, *G + 1),nop),nop)##
writeLine(stdout,Total size of data owned by *D on resource *E is *A)##
writeLine(stdout,Number of files is *G)|nop
*D= rods%*E= renci-vault1%
*Condition= DATA_OWNER_NAME = 'rods' AND RESC_NAME = 'renci-vault1'
ruleExecOut
Managing Structured Information
• Information exchange between micro-services
– Parameter passing
– White board memory structures
– High performance message passing (iXMS)
– Persistent metadata catalog (iCAT)
• Structured Information Resource Drivers
– Interact with remote structured information
resource (HDF5, netCDF, tar file)
Structured Data
• Aggregate data into a tar file
– Mount a tar file to enable manipulation of files
within the tar file
• Use HDF5 to manage aggregations of files
– Micro-services that apply HDF5 library calls at the
remote storage location
• Mount a remote directory
– Synchronize files in directory with files in iRODS
collection
Micro-services vs Web Services
• Micro-services
– Manage exchange of structured information
between micro-services through memory
– Serialize information for transmission over a
network
– Optimized protocol for data transmission
• Single message for small files (<32 MBs)
• Parallel I/O for large files
• Web Services
– SOAP /HTTP data transmission between services
Research Collaborations
• NSF NARA - supports application of data grids to
preservation environments
• NSF SDCI - supports development of core iRODS data
grid infrastructure
• NSF OOI - future integration of data grids with realtime sensor data streams and grid computing
• NSF TDLC - production TDLC data grid and extension
to remaining 5 Science of Learning Centers (0.3 FTE)
• NSF SCEC - current production environment (0.1 FTE)
• NSF Teragrid - production environment (0.1 FTE)
NSF Software Development for
Cyberinfrastructure
• Conduct research on policy management in
distributed data systems
–
–
–
–
–
–
–
–
Collection oriented data management
Adaptive middleware architecture
Distributed rule engine
Server-side (remote) workflow execution
Transactional recovery semantics
Automated validation
Automation of large-scale data administrative functions
Enforcement of management policies
Overview
iRODS
Architecture
iRODS ShowsofUnified
“Virtual
Collection”
User
With Client, Views
& Manages
Data
Archivist Sees Single “Virtual
Collection”
Processing Cache
Archive
Access Cache
Disk, Tape, Database, File
system, etc.
Disk, Tape, Database, File
system, etc.
Disk, Tape, Database,
File system, etc.
The iRODS Data Grid installs in a “layer” over existing or new data, letting you view, manage,
and share part or all of diverse data in a unified Collection.
Generic Data Management Systems
iRODS - integrated Rule-Oriented Data System
Data Management
Environment
Management
Functions
Data Management
Infrastructure
Physical
Infrastructure
C onserved
C ontrol
Remote
Properties
Mechanisms O peration s
Assessment
Management Management
Criteria
Policie s
Procedures
Data grid ĞManagement virtualization
State
Rules
Micro-services
Information
Data grid ĞData and trust virtualization
Database
Rule Engine Storage
System
NARA Preservation Application
• Transcontinental Persistent Archive Prototype
– Use data grid technology to build a preservation
environment
– Conduct research on preservation concepts
•
•
•
•
Infrastructure independence
Enforcement of preservation properties
Automation of administrative preservation processes
Validation of preservation assessment criteria
– Demonstrate preservation on selected NARA digital
holdings
• Integration of generic infrastructure with preservation
technologies (Cheshire, MVD, JHOVE, Pronom, Fedora, Dspace)
Preservation is an Integral Part
of the Data Life Cycle
• Organize project data into a shared collection
• Publish data in a digital library for use by other
researchers
• Enable data-discovery & data-driven analyses
• Preserve reference collections for use by future
research initiatives
• Analyze new collection against prior state-of-the-art
data
• Define & Enforce Policies for long-term management
and curation
National Archives and Records Administration
Transcontinental Persistent Archive Prototype
Federation of Seven
Independent Data Grids
NARA I
MCAT
NARA II
MCAT
Georgia Tech
MCAT
Rocket Center
MCAT
U NC
MCAT
U Md
MCAT
UCSD
MCAT
Extensible Environment, can federate with additional research and
education sites. Each data grid uses different vendor products.
To Manage Long-term Preservation
• Define desired preservation properties
– Authenticity / Integrity / Chain of Custody / Original
arrangement
– Life Cycle Data Requirements Guide
• Implement preservation processes
– Appraisal / accession / arrangement / description /
preservation / access
• Manage preservation environment
– Minimize costs
– Validate assessment criteria to verify preservation
properties
ISO MOIMS
repository assessment criteria
• Are developing 150 rules that implement the
ISO assessment criteria
90 Verify descriptive metadata and source
against SIP template and set SIP
compliance flag
91 Verify descriptive metadata against
semantic term list
92 Verify status of metadata catalog backup
(create a snapshot of metadata catalog)
93 Verify consistency of preservation
metadata after hardware change or error
Sustainability
• Economic sustainability
– Reference collections
– Repurpose reference collections to support use by multiple
communities
– Federate resources across multiple communities
• Technological sustainability
– Open source software
– Support continued porting through international collaborations
• Policy sustainability
– Evolve management policies to support new user communities
• Access sustainability
– Support data manipulation and display by new communities
Data Virtualization
Access Interface
Map from the actions
Standard Micro-services
Data Grid
Standard Operations
Storage Protocol
Storage System
requested by the
access method to a
standard set of microservices. The
standard microservices are mapped to
the operations
supported by the
storage system
Migration of Parsing Routines
• Data Grids minimize the effort needed to
sustain parsing routines
– Parsing routine is encapsulated as a micro-service
– New clients can then be ported on top of the data
grid without changing the parsing routine
• Map from actions to standard actions
– New storage systems can be added to the data
grid without changing the parsing routine
• Map from standard operations to storage protocol
Clients
•
•
•
•
•
•
•
•
•
•
Unix shell commands
Java I/O library
C I/O redirection library
Windows browser
Web-DAV
Kepler workflow
HDF5 client
DSpace
Fedora
Python library
Scale of iRODS Data Grid
• Number of files
– Tens of millions to hundreds of millions of files
• Size of data
– Hundreds of terabytes to petabytes of data
• Number of policy enforcement points
– 20 actions define when policy is checked
• Amount of metadata
– 112 metadata attributes for system information per file
• Number of policies
– 150 policies
• Number of data grids
– Federation of tens of data grids
Federation Across Spatial Scales
• International collaborations
– Australian Research Collaboration Service (ARCS)
– Sustaining Heritage Access through Multivalent ArchiviNg (SHAMAN)
– Cinegrid
• National collaborations
– Temporal Dynamics of Learning Center (TDLC)
– Ocean Observatories Initiative (OOI)
• Regional collaborations
– LSU data grid
– HASTAC humanities data grid
– Distributed Custodial Archive Preservation Environment (DCAPE)
• State collaborations
– RENCI data grid
– North Carolina State Library
• Institutional repositories
– Carolina Digital Repository
– SIO Repository
Integrating across
Supercomputer / Cloud / Grid
iRODS Data Grid
iRODS Server
Software
iRODS Server
Software
iRODS Server
Software
Supercomputer
File System
Cloud Disk
Cache
Teragrid Node
Parallel
Application
Virtual Machine
Environment
Grid
Services
RENCI
OOI
SCEC
ARCS Data Fabric
Davis – Modes
Temporal Dynamics of Learning Center
Scientist A
Scientist B
Adds data to
Shared Collection
Accesses and
analyzes shared
Data
iRODS Data System
Brain Data
Server, CA
Audio Data
Server, NJ
Video Data
Server, TN
iRODS
Metadata
Catalog
Scientists can use iRODS as a “data grid” to share multiple types of data, near
and far. iRODS Rules also enforce and audit human subjects access restrictions.
iRODS Evaluations
• NASA Jet Propulsion Laboratory
– iRODS selected for managing distribution of Planetary Data
System records
• NASA National Center for Computational Sciences
– iRODS chosen to manage archive of simulation output and
serve as access data cache for distribution
• AVETEC appraisal for DoD HPC centers
– iRODS now provides all required capabilities
• French National Library
– iRODS rules control ingestion, access, and audit functions
• Australian Research Coordination Service
– iRODS manages data distributed between academic
institutions
Development Team
• DICE team
–
–
–
–
–
–
–
Arcot Rajasekar - iRODS development lead
Mike Wan - iRODS chief architect
Wayne Schroeder - iRODS developer
Bing Zhu - Fedora, Windows
Lucas Gilbert - Java (Jargon), DSpace
Paul Tooby - documentation, foundation
Sheau-Yen Chen - data grid administration
• Preservation
– Richard Marciano - Preservation development lead
– Chien-Yi Hou - preservation micro-services
– Antoine de Torcy - preservation micro-services
Foundation
• Data Intensive Cyber-environments
– Non-profit open source software development
– Promote use of iRODS technology
– Support standards efforts
– Coordinate international development efforts
•
•
•
•
IN2P3 - quota and monitoring system
King’s College London - Shibboleth
Australian Research Collaboration Services - WebDAV
Academia Sinica - SRM interface
iRODS is a "coordinated NSF/OCI-Nat'l Archives
research activity" under the auspices of the
President's NITRD Program and is identified as
among the priorities underlying the President's 2009
Budget Supplement in the area of Human and
Computer Interaction Information Management
technology research
Reagan W. Moore
[email protected]
http://irods.diceresearch.org
NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype”
NSF SDCI-0721400 “Data Grids for Community Driven Applications”