Storage Resource Broker Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.sdsc.edu/srb/ Topics • Preservation environments • Digital library technology • Data grid technology • Fundamental concepts.
Download ReportTranscript Storage Resource Broker Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.sdsc.edu/srb/ Topics • Preservation environments • Digital library technology • Data grid technology • Fundamental concepts.
Storage Resource Broker Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.sdsc.edu/srb/ Topics • Preservation environments • Digital library technology • Data grid technology • Fundamental concepts / future research • Data / information / knowledge • Persistent objects • Knowledge management Preservation • Archival processes through which a digital entity is extracted from its creation environment, and then supported in a preservation environment, while maintaining authenticity and integrity information. • Extraction process requires insertion of support infrastructure underneath the digital material • Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism Preservation Communities • InterPARES - diplomatics • Preservation of records • NARA • Preservation of records from federal agencies • State archives • Preservation of submitted “collections” • Continuum model • Preservation of active data and records Preservation • What differentiates a preservation environment from a digital library? Digital Libraries • Support the community vocabulary • Discovery and browse using community relevant terms • Support the community data format • Maintain information on the data format of each item • Support the community access services • Provide services that manipulate and display the community data format Preservation Mandates • Diplomatics • Authenticity • Integrity • NARA • Infrastructure independence • Scalability • State archives • Automation of archival processes InterPARES - Diplomatics • Authenticity - maintain links to metadata for: • • • • • • • • • • • • • • • Date record is made Date record is transmitted Date record is received Date record is set aside [i.e. filed] Name of author (person or organization issuing the record) Name of addressee (person or organization for whom the record is intended) Name of writer (entity responsible for the articulation of the record’s content) Name of originator (electronic address from which record is sent) Name of recipient(s) (person or organization to whom the record is sent) Name of creator (entity in whose archival fonds the record exists) Name of action or matter (the activity for which the record is created) Name of documentary form (e.g. E-mail, report, memo) Identification of digital components Identification of attachments (e.g. digital signature) Archival bond (e.g. classification code) InterPARES - Diplomatics • Integrity - maintain links to metadata for • Name(s) of the handling office / officer • Name of office of primary responsibility for keeping the record • Annotations or comments • Actions carried out on the record • Technical modifications due to transformative migration • Validation Preservation Approach • Provide mechanisms to: • Create archival context for the content • Context is preservation metadata (provenance, administrative, descriptive, structural, behavioral) • Content is the submitted digital entity • Assert integrity - the consistency between the context and the content • Track operations done on material and update context • Assert authenticity - that the material represents the original site • Track the chain of custody • Manage technology evolution (encoding standard, storage repository, information repository, access methods) Data Grids • What is the difference between a preservation environment and a data grid? Data Grids • Manage shared collections that are distributed in space • Location of item, access controls, checksums • Implement infrastructure independence • Standard operations for interacting with storage repositories • Implement presentation independence • Standard APIs to support porting of user interfaces Preservation Environment • Digital library infrastructure that supports • Preservation metadata • Arrangement and description of items • Access mechanisms • Data grid infrastructure that supports • Shared collections that are migrated forward in time • Management of technology evolution • Administrative metadata providing status of records Infrastructure Independence Data Access Methods (Web Browser, DSpace, OAI-PMH) Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints Naming conventions provided by storage systems Data Grids Provide a Level of Indirection for Each Naming Convention Data Access Methods (C library, Unix, Web Browser) Data Collection Storage Repository Data Grid • Storage location • Logical resource name space • User name • Logical user name space • File name • Logical file name space • File context (creation date,…) • Logical context (metadata) • Access constraints • Control/consistency constraints Data is organized as a shared collection Demonstration • • • • • • Logical file name space Distinguished user name space Shared collection Distributed data storage Replication as a file property Digital entities Data Grids • Provide two levels of indirection: • Low level API used to interact with storage repositories • Standard operations for manipulating files in a storage system • Standard operations for manipulating a catalog stored in a database • High level API used to support user interfaces • Three basic APIs - “C” library call, Unix shell commands, Java class library • Other are interfaces ported on top of the basic APIs. Storage Resource Broker 3.3 Application C Library, Java Unix Shell Linux I/O NT Browser, Kepler Actors C++ DLL / Python, Perl, Windows HTTP, OAI, DSpace, WSDL, OpenDAP, (WSRF) GridFTP Federation Management Consistency & Metadata Management / Authorization, Authentication, Audit Logical Name Space Database Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix Latency Management Data Transport Metadata Transport Storage Repository Abstraction Archives - Tape, File Systems Sam-QFS, DMF, ORB Unix, NT, HPSS, ADSM, Mac OSX UniTree, ADS Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix Accessing Multiple Types of Storage Systems User Application Archive at SDSC Archive at NARA Archive at U Md Standard Data Access Operations Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Collective operations Replication Fault tolerance Load leveling User Application Common set of operations for interacting with every type of storage repository Archive at SDSC Archive at NARA Archive at U Md Accessing Data at Multiple Sites Each site has their own naming convention for files User Application A data grid provides a uniform way to name and access the files across the sites Archive at SDSC Archive at NARA Archive at U Md Building a Distributed Collection Logical name space Location independent identifier Persistent identifier Collection owned data Authenticity metadata Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system User Application Data Grid Common naming convention and set of attributes for describing digital entities Archive at SDSC Archive at NARA Archive at U Md Federated Server Architecture Read Application Logical Name Or Attribute Condition Peer-to-peer Brokering Parallel Data Access 1 6 SRB server 3 SRB server 4 SRB agent 5 SRB agent 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control 5/6 2 R1 MCAT Data Access R2 Server(s) Spawning Managing Access • Authenticate users independently of storage systems • Preservation environment owns the data • Authorize data access independently of storage system • ACLs on both data and metadata • Maintain audit trails of all accesses • Both read and write Collection-owned Data • Store data at remote storage system under data-grid ID • Access data through data grid servers • Track all operations on data and update state information • User authenticates to a data grid server • Access controls are checked for permissions • Data grid servers authenticate messages from other servers • Remote server authenticates to remote storage system • Multiple authentication mechanisms • GSI / challenge-response / tickets Provide Context for Data • Properties of files • Provenance - source • Descriptive attributes • Structure • Organize properties as metadata in a collection hierarchy • Define operations on file properties • Manage state information - location, replicas, containers • Separate context management from content management • Maintain consistency of context as operations are done on content Database Operations • Standard interface to support • • • • • Schema extension - user defined attributes Snowflake table creation SQL generation Import and export of XML files Bulk metadata load and unload • Operations required to manage a catalog that resides in a database National Archives and Records Administration Research Prototype Persistent Archive Demonstrate preservation environment • Authenticity • Integrity • Management of technology evolution • Mitigation of risk of data loss • Replication of data • Federation of catalogs • Management of preservation metadata • Scalability • EAP collection • 350,000 files • 1.2 TBs in size Federation of Three Independent Data Grids NARA MCAT Principle copy stored at NARA with complete metadata catalog U Md MCAT Replicated copy at U Md for improved access, load balancing and disaster recovery SDSC MCAT Deep Archive at SDSC, no user access, but complete copy Preservation Requirements • Maintain authenticity and integrity of electronic records • Authenticity - assertion of provenance of data • Integrity - assertion of invariance of bits • Manage risk of data loss • Media corruption / System failures / Operational errors / Natural disaster / Malicious users • Manage technology obsolescence • Support migration of collection to new systems • Bulk data operations Replication • How many replicas are enough? Federation Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection A Data Grid Data Collection B Data Grid • Logical resource name space • Logical resource name space • Logical user name space • Logical user name space • Logical file name space • Logical file name space • Logical context (metadata) • Logical context (metadata) • Control/consistency constraints • Control/consistency constraints Access controls and consistency constraints on cross registration of digital entities Data Grid Zones • Choose how name spaces will be shared • Cross register storage resources • May the other data grid write to my storage? • Cross register user names • Users are authenticated by their home zone • Cross register files • Can replicate files into another data grid • Cross register metadata • Can build a copy of the metadata catalog Peer-to-Peer Data Grids Free Floating Partial User-ID Sharing Replication Constraints Occasional Interchange Partial Resource Sharing Replicated Data No Metadata Synch System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing User and Data Replica Resource Interaction Access Constraints System Managed Replication Connection From Any Zone Complete Resource Sharing Replicated Catalog Replication Data Grids Federation Environments Consistency Constraints Hierarchical Zone Organization One Shared User-ID Nomadic System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Snow Flake Super Administrator Zone Control Master Slave System Controlled Complete Synch No User-ID Sharing Deep Archive Hierarchical Data Grids Examples of Extensibility • Storage Repository Driver evolution • • • • • • • • • Initially supported Unix file system Added archival access - UniTree, HPSS Added FTP/HTTP Added database blob access Added database table interface Added Windows file system Added project archives - Dcache, Castor, ADS Added Object Ring Buffer, Datascope Adding GridFTP version 3.3 • Database management evolution • • • • • • Postgres DB2 Oracle Informix Sybase mySQL (most difficult port - no locks, no views, limited SQL) Examples of Extensibility • The 3 fundamental APIs are C library, shell commands, Java • Other access mechanisms are ported on top of these interfaces • API evolution • • • • • • • • • Initial access through C library, Unix shell command Added inQ Windows browser (C++ library) Added mySRB Web browser (C library and shell commands) Added Java (Jargon) Added Perl/Python load libraries (shell command) Added WSDL (Java) Added OAI-PMH, OpenDAP, DSpace digital library (Java) Added Kepler actors for dataflow access (Java) Adding GridFTP version 3.3 (C library) Storage Resource Broker Collections at SDSC (2/22/2005) Data Grid NSF/ITR - National Virtual Observatory NSF - National Partnership for Advanced Computational Infrastructure Hayden Planetarium - Evolution of the Solar System vi sualizations Public collections - NSF/NPACI - Joint Center f or Structural Genomics NSF/NPACI - Biology and E nvironmental c ollections NSF - TeraGrid, ENZO Cosmology simulations GBs of data stored Ê Number of files Ê Number of Users Ê 53,862 31,263 7,201 5,455 20,364 155,980 9,536,751 6,435,338 113,600 3,405,266 52,159 1,157,168 100 380 178 67 67 3,176 NIH - Biomedical Informatics Research Network 9,830 6,632,159 241 Miscellaneous static collections Digital Library 8,013 Ê 161,352 Ê 241 720 253 2,620 559 2,654 92 99,010 45,365 8,892 53,048 71,318 1,052,202 2,387 2,074,138 NLM - Digital Embryo image collection NSF/NPACI - Long Term Ec ological Reserve NSF/NPACI - Grid Portal NIH - Alliance for Cell Signaling microarray d ata NSF - National Science Digital Library SIO Explorer collection NSF/NPACI -Transana education r esearch vi deo collection NSF/ITR - Southern California Earthquake Center Persistent Archive Ê NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota) UCSD Libraries archive NARA- Research Prototype Persistent Archive NSF - National Science Digital Library persistent archive TOTAL Ê Ê 23 36 460 21 27 26 64 Ê 90 4,147 991 3,572 372,947 408,050 455,094 26,918,638 28 29 58 136 404 TB 59 million 5,167 Sites Using the SRB Academia Sinica, Taiwan ASCC, Computing Centre, Taiwan Australian National Univ ersity Bedf ord Oceanography ,Canada Bioinf ormatics Institute, Singapore CSIRO, Australia D ata S torage Institute, S ingapore EGEE, French National Center GeoForschungsZentrum, Germany James Cook Univ ersity , Australia KEK High Energy Phy sics, Japan Max Planck Institute, Netherlands Parallab, Norway South Australian Adv anced Computing UIB (Parallab) , Norway Univ ersity of Amsterdam Univ ersity of Cambridge, Astronomy Univ ersity of Cambridge, e-Science Univ ersity of Edinburgh Univ ersity of Genoa, Italy Univ ersity of Hong Kong Univ rsity of Manchester Univ ersity of Oslo Univ ersity of Southampton Y ork Univ (UK) CiteSeer, Penn State City Univ . of New Y ork Geospatial Env ironment, UCSD Drexel Univ ersity EOSDIS Distributed Activ e, NASA Goddard Georgia Tech Kentucky State Libraries & Archiv es Library of Congress Los Alamos National Lab NASA Ames NASA Goddard Space Flight Center NCSA Grid Computing NIH (NCI Center f or Bioinf ormatics) Penn State Univ ersity Pittsburgh Supercomputing Center Purdue Univ ersity . Indiana Stanf ord Univ ersity TACC, Univ ersity of Texas Texas A & M UC Santa Cruz UCLA UCSD Neuroscience Univ ersity of Mary land Univ ersity of Michigan, CAC department Univ ersity of New Mexico Univ ersity of Washington Univ ersity of Wisconsin USC Y ale Univ ersity Research Areas • Characterization of data / information / knowledge • Preservation architecture • Knowledge management - dynamic application of preservation policies • Persistent object - characterization of digital entities Characterizing Knowledge • Data - bits that comprise a digital entity • Information - a semantic label that is applied to data • Knowledge - relationships between semantic labels • Metadata - the combination of the semantic label and the data • The creation of a semantic label is driven by the application of a process / relationship • Information is the result of applying knowledge relationships • Information is the reification of knowledge Knowledge Management • Reify relationships to improve access performance • Easier to query on metadata than to apply the original relationships • Manage state information about the reification process - support for relationship changes • Support levels of granularity for application of relationships - collective properties versus procedural properties • Goal is to build a scalable knowledge management system Preservation Strategies • Emulation • Migrate the display application onto new operating systems • Equivalent to forcing use of candlelight to look at 16th century documents • Transformative migration • Migrate the encoding format to the new standard • Migration period is expected to be 5-10 years • Persistent object • Characterize the encoding format • Migrate the characterization forward in time Persistent Objects Display Applications 1980 1990 2000 2010 2020 Characterize standard manipulation operations Characterize encoding format - data structure 1980 1990 Digital Entities 2000 2010 2020 Preservation Standards • OAIS - Open Archival Information System • Submission Information Package (SIP) • Archival Information Package (AIP) • Dissemination Information Package (DIP) • Producer Archive Interface Abstract Methodology Standard • (CCSDS Document 651.0-R-1) Containers • SRB provides support for aggregation of files into a container • AIP is the aggregation of both preservation context and the records into a container • What is the appropriate form for a selfdescribing container? Self-instantiating Archive • Preservation of Digital Data with SelfValidating, Self-Instantiating KnowledgeBased Archives, B. Ludäscher, R. Marciano, R. Moore, SIGMOD Record, ACM, 30(3), pp. 54-63, 2001. • An archives consists of the application of archival processes to create the collection managed in the preservation environment • Instantiation corresponds to the application of the archival processes to the original data Example Web Crawl • National Science Digital Library maintains registry of URLs for education material at Cornell • Crawl sites • Recursion to a depth of 10 redirections • Restriction to pages within initial site plus one level outside site • Store material on processing platform • 70,000 URLs - 2 million digital entities, 200 GB • On average 30 files per URL, • Each file with average size 100 kBytes Collection Requirements • Provide containers for managing small files • 26 million files, average size 100 kB • Aggregate data in containers before storage • Support web-based access to archived data • Redirect web page internal HTTP links to data grid handles • Support integrity • Manage checksums on files Accessioning Web Sites • Use OAI harvesting to extract URLs from the NSDL repository • Crawl each URL and process each digital entity • Replace internal URLs with data grid logical names • Aggregate digital entities into containers (files) for storage. Archives store files that are 40 MBytes in size. • Generate archival context • Register digital entity into a data grid • Use collection hierarchy to associate web crawl properties with each file (date, site, initial URL, …) • Write processed files into a storage system managed by a data grid • Replicate data on Grid Bricks and archival storage system • Provide OAI interface for reporting validation results Persistent Archive Collections • Build collections based on date crawled • For each collection, use separate folder to hold digital entities associated with the original URL • Typically 30 digital entities per URL • Aggregate digital entities into containers before storage • Preservation metadata maintained for each digital entity • Administrative, descriptive, structural, behavioral A Few Statistics on NSDL Content SDSC Crawl (April 03, 4 Links Deep) received correctly no data received see other forbidden file not found internal server error application error service temp. overloaded WIMS User Error Gone unused redirection w/out location — 1,530,206 — 51 — 5 — 311 — 38,386 — 946 — 15 — 8 — 1 — 1 — 1 — 1 total digital entities error percentage — 1,569,932 — 2.53% Encoding Formats Present in Archive Digital Entity Type html gif jpg xml txt pdf css doc asp ppt xls Number of files 331557 157891 136445 21528 17433 9369 4073 862 819 161 15 CSS - Cascading Style Sheet ASP - Microsoft Active Server Page Automated Processes: Categorizing the “Space” of all Descriptive Patterns • Data-driven validation of descriptive metadata from NARA Archival Information Locator records • Exhaustive examination of every metadata occurrence • Automatic creation of an open-source relational database implementation • Accumulation of all descriptive patterns • Based on deriving “Descriptive Signatures” relying on regular expressions • Creation of a Perl-based Validation Regular Expression Tool • Refined regular expression to identify anomalies in the legacy metadata • Annotated artifacts introduced by archival processes A String Analysis Approach Accumulate all occurrence strings at each level of description in the hierarchy, and derive a regular expression that characterizes all instances: • Record Group OR Collection (total of ~550) • Series • File Unit • Item OR ItemAV (audio-visual) ___________________________________________ • Physical Occurrence • Media Occurrence • Object A String Analysis Approach Example - structural characterization: • At the Series level, possible patterns are (S=Series, I=Item, O=Object, F=FileUnit): • SIOSIOOOOOO • SIO • SFFFF • SIIIIII • SIOIOOOOO • SFIIII • An inferred regular expression is: • S( F*(I+O+)* | I+ )* • Relational tables are derived from these regular expressions for each of the 9 levels Metadata Validation • Analyze each regular expression to identify the classes of anomalies • Cases in which a subset of the objects have a unique characterization different from the majority of the objects • Identify cases with incorrect metadata tags • Identify cases with missing metadata or missing objects • Identify changes in metadata definitions Regular Expressions • COLLECTION (2 characterizations): • *********** • ="TiMtldColid(XcXs)*(Date)?(Ab)?(Tcsd(Tcsdq)?Tced(Tcedq )?)?(Tisd(Tisdq)?Tied(Tiedq)?)?" • = "(Odonor)?(Pdonor)*(Daut(Ndad)?)?(FatFan)?Dcgsd" • RECORD GROUP (2 characterizations): • ************* • ="TiMtldGrno(Date)?(Tcsd)?(Tcsdq)?(Tced)?(Tcedq)?Tisd(Ti sdq)?Tied(Tiedq)?" • = "(FatFan)*Dcgsd" Regular Expressions • • • • SERIES (5 characterizations): ******* = "Ti(Altti)*MtldS(Grno)?(Formerrg)*(Colid)?" ="(Acnum)*(Arra)?(Chn)?(Date)?(Funcu)?(Gen)*(Numb)?(Sc ale)?(Ab)?(Tran)?(Itn)?(Staff)?(Rctno)*(Dano)*(XcXs)+" • ="((Tcsd)?(Tcsdq)?Tced(Tcedq)?)?(Tisd(Tisdq)?Tied(Tiedq) ?)?" • ="(Grt)*(Srt)*(OcontrOcontrtp)*(Orefer)*(Tgn)*(Lan)*(PcontrP contrtp)*(Prefer)*(Subj)*((Ars)?(Sar)*(Arsn)?)" • ="(Urrs)?(Surr)?(Urrn)?((Daut)?(Ndad)?)*(Fat(Fan)?)*(MpiMp t(Mpn)?)*(Taed)?(Tst)?(CrorgCrorgtp)?(CrindCrindtp)*(Dcgs d)" • --> SarSar: Decision to combine them Regular Expressions • ITEM (4 characterizations): • ***** • ="Ti(Altti)*Mtld(Grno)?(Formerrg)?(Colid)?(Acnum )?(Arra)?(Date)?(Gen)*(Ab)?(Staff)?(XcXs)+" • = "((Tcsd)*(Tcsdq)?(Tced)*(Tcedq)?)?" • ="(Tpd(Tpdq)?)*(Grt)+(Srt)*(OcontrOcontrtp)*(Oref er)*(Tgn)*(PcontrPcontrtp)*(Prefer)*(Subj)*(Ars)?(S ar)*(Arsn)?" • ="Urrs(Surr)?(Urrn)?((Daut)?Ndad)*(MpiMpt(Mpn) ?)*(CrorgCrorgtp)?Dcgsd" • --> SarSar: Decision to combine them • --> Tcsd/Tced: Regular Expression • FILE UNIT (4 characterizations): • ********** • ="Ti(Altti)?Mtld(Grno)?(Formerrg)?(Colid)?(Acnum)?(A rra)?(Gen)*(Ab)?(XcXs)+" • ="((Tcsd)*(Tcsdq)?(Tced)*(Tcedq)?)?(Tisd(Tisdq)?Tied( Tiedq)?)?" • ="(Grt)+(Srt)*(OcontrOcontrtp)?(Orefer)*(Tgn)*(PcontrP contrtp)?(Prefer)*(Subj)*(Ars)?(Sar)*(Arsn)?" • ="Urrs(Surr)?(Urrn)?((Daut)?Ndad)?(MpiMpt(Mpn)?)?(C rorgCrorgtp)?Dcgsd" • --> SarSar: Decision to combine them • --> Tcsd/Tced: Lessons Learned • Data-driven analysis of actual preservation metadata can be used to implement a new catalog on new technology • Variant of self-instantiating archive, in which the preservation structure and catalogs are re-created Preservation • Archival processes through which a digital entity is extracted from its creation environment and migrated to a preservation environment, while maintaining authenticity and integrity information. • Extraction process requires insertion of support infrastructure underneath the digital material, characterization of the authenticity and integrity, characterization of the digital encoding format, and characterization of the display operations • Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism For More Information Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.sdsc.edu/srb/