Releasing Open Source at the Library of Congress Repository Development Center Office of Strategic Initiatives Leslie Johnston 2009 LITA Forum.

Download Report

Transcript Releasing Open Source at the Library of Congress Repository Development Center Office of Strategic Initiatives Leslie Johnston 2009 LITA Forum.

Releasing Open Source at the Library of Congress

Repository Development Center Office of Strategic Initiatives Leslie Johnston 2009 LITA Forum

S TARTING D OWN A P ATH T OWARDS B ETTER C ONTROL

What are our most basic needs? What is the first step?

 

How do we know what we have, where it is, and who it belongs to? How do we get files – new and legacy – from where they are to where they need to be?

Repository Development Center

/ Office of Strategic Initiatives p.2

I DENTIFYING THE T RANSFER P ROBLEM S PACE

 As part of its first phase repository development, the Library of Congress is working on solutions for a category of activities that we refer to as “Transfer.” At a high level, we define transfer as including the following human- and machine-performed tasks:     Adding digital content to the collections, whether from an external partner or created at LC; Moving digital content between storage systems (external and internal); Review of digital files for fixity, quality and/or authoritativeness; and Inventorying and recording transfer life cycle events for digital files.

Repository Development Center

/ Office of Strategic Initiatives p.3

R ECENT T RANSFER E XPERIENCE During 2008 the Library of Congress received:

    30 Tb from NDIIPP preservation partners, 20 Tb in Web Capture crawls to preserve identified web sites, 30 Tb from National Digital Newspaper Project (NDNP) partners, and 1 Tb from World Digital Library partners.

• From 20 MB to over 2 Tb in a single transfer retrieved over the network.

Dozens of hard drives with licensed, partner and vendor supplied content.

All forms of content, some to be dark archived for preservation, some limited to Library use, and some to be made publicly available.

There is also newly internally digitized content that has to be managed.

Repository Development Center

/ Office of Strategic Initiatives p.4

D EVELOP A S TANDARD AND T OOLS TO O PTIMIZE T RANSFERS

  

BagIt: A Packaging Specification for File Transfers

A packaging specification for file transfers. Supports minimally self-identifying and self-describing packages with support for error detection and transfer optimization.

Motivating use cases: • Transfer of content internally and between preservation partners.

• Long-term storage of content.

Needs: • Minimally self-identifying and self-describing packages.

• Support for error detection and transfer optimization.

Characteristics: • Low overhead • • Content-type agnostic Supported by off-the-shelf, easily supported tools.

http://www.digitalpreservation.gov/library/resources/tools/docs/bagitspec.pdf

Repository Development Center

/ Office of Strategic Initiatives p.5

W HAT’S IN A B AG?

/data directory with contents Package description: bag-info.txt

Manifest of contents with checksums

Repository Development Center

/ Office of Strategic Initiatives p.6

T RANSFER T OOL D EVELOPMENT

To promote the use of BagIt in the Library and outside, tools were required to make the specification easy to use.

        Parallel Retriever script  Efficient package transfer Validation script  Validates Bags against the BagIt specification VerifyIt script  Verifies that files are uncorrupted BagIt Java Library (BIL)  Used for application and command line tool development Bagger Desktop application  Graphical desktop tool to create/update/validate Bags LocDrop Web application  Supports partner registration of transfers, whether shipping a hard drive or sending over the network.

Inventory System  Record lifecycle events for packages of Bags and files.

Workflow Tools

Repository Development Center

/ Office of Strategic Initiatives p.7

T RANSFER T OOL D EVELOPMENT: B AGGER

 Bagger Graphical Bag Authoring Tool • Allows users to create generic Bags or Bags that meet specified project profiles.

• • • • • • Provides project-specific templates that enforce project Bag descriptive metadata requirements.

Built on top of the BagIt Java Library.

Presents a range of options for compressed serialization and complete versus “holey” bags.

Java Webstart version automatically checks for the most recent version to keep the tool updated.

Standalone version is bundled with all necessary software and runs without requiring installation privileges.

Runs on a PC or Mac.

Repository Development Center

/ Office of Strategic Initiatives p.8

U SING B AGGER

Add files to the /data directory create and select a profile Entering bag-info metadata

Repository Development Center

/ Office of Strategic Initiatives p.9

U SING B AGGER

Completed bag with generated manifest

Repository Development Center

/ Office of Strategic Initiatives p.10

L OC D ROP T OOL D EVELOPMENT

 LocDrop is designed to support notification for transfers of content into the Library of Congress both from outside the Library and within the Library itself. The application currently lets you register network and physical media transfers (hard drives, CDs, DVDs, etc.) that the Library will retrieve. In later versions we expect to add the ability to launch network transfers directly.  LocDrop will simplify the processes to track content we expect to receive. Over time, we expect to connect this application to related services that will continually improve how we manage the transfer and receipt of materials from all sources.

Repository Development Center

/ Office of Strategic Initiatives p.11

U SING L OC D ROP

Register the information needed to track data shipments to and from the Library

Repository Development Center

/ Office of Strategic Initiatives p.12

U SING L OC D ROP

Register the information needed for the Library to retrieve network transfers

Repository Development Center

/ Office of Strategic Initiatives p.13

I NVENTORY T OOL D EVELOPMENT

    Record Package Events  Examples of Package Events include “Package Received Events,” which are recorded when a project receives a package; and “Package Accepted Events,” which are recorded when a project accepts curatorial responsibility for a package.

Record File Events  Examples of File Events include “File Copy Events,” which are recorded when a package is copied from one File Location to another; and “Quality Review Events,” which are recorded when quality review is performed. For legacy collections the Inventory Tool can be pointed at existing file systems and directories to package, checksum, and record life cycle events to bring the files under initial control.

The Inventory Tool is implemented on top of our BIL Java Library.

Repository Development Center

/ Office of Strategic Initiatives p.14

U SING THE I NVENTORY T OOL

Running an Inventory operation

Repository Development Center

/ Office of Strategic Initiatives p.15

U SING THE I NVENTORY T OOL

Searching the Inventory, plus auditing, file count, space usage, and project-specific Inventory reports

Repository Development Center

/ Office of Strategic Initiatives p.16

W ORKFLOW D EVELOPMENT

 The Transfer components and Inventory Tool are tied together through multiple project-based Workflow systems.

   Through case study development with stakeholders we identify the data flow and tasks to be performed.

Workflow tasks formalized through the system include transfer, validation by an format validation application, manual quality review inspection, and file copying to archival storage and production storage. A workflow UI allows users to initiate, monitor and administer processes; and notify the workflow engine of the outcome of manual tasks, including task completion.

Repository Development Center

/ Office of Strategic Initiatives p.17

R UNNING A W ORKFLOW

Starting, searching, and monitoring workflows

Repository Development Center

/ Office of Strategic Initiatives p.18

R UNNING A W ORKFLOW

Updating an in-progress workflow

Repository Development Center

/ Office of Strategic Initiatives p.19

I NITIATING THE O PEN S OURCE R ELEASE

   It was decided that the three utility scripts – the key tools needed for the movement and validation of Bagged content – should be the first candidates for open source release. The scripts were submitted to the Office of General Counsel at the Library for review. This review included close scrutiny by the attorneys in the office for everything from purpose (automating a process) to originality (determining that no code came from any other licensed sources) to authorship (Library staff versus Library contractors). Due to some contractual obligations with a contracting company which prohibited straightforward public domain release, the three scripts were released on SourceForge in December 2008 under a BSD license. http://sourceforge.net/projects/loc-xferutils/

Repository Development Center

/ Office of Strategic Initiatives p.20

C ONTINUING THE O PEN S OURCE R ELEASE

    The next vital release had to be BIL—the BagIt Library—a Java library developed to support Bag services.  A barrier to uptake of the BagIt specification was the ability to automate the Bagging process and to support the development of tools. BIL supports key functionality such as creating, manipulating validating, and verifying Bags, as well as the uploading of Bags using the SWORD deposit protocol.

The review of BIL for open source release by the Office of General Counsel was a more complex affair. There was a single author who was a Library staff member, but there were thirteen bundled dependencies each with their own licenses to be reviewed. BIL was released into the public domain with the understanding that those licenses restricted any bundling of BIL and its dependencies into new tools by others, but in no way restricted the release.

BIL was released as both compiled and source code in June 2009.

Repository Development Center

/ Office of Strategic Initiatives p.21

M ANAGING THE R ELEASE

  At the time of both releases the Library made a conscious decision to just release the code, and not take advantage of the SourceForge functionality that supports the committing of code back into the project.   These were three relatively simple scripts and it seemed to make the most sense to release them and let others work with them or use them to model their own development. No one was available at the time who could devote the effort needed to manage a full-blown open source project.

The scripts can be updated by anyone in the community for their use. The Library has committed to releasing its updates to BIL. Updates to the source code are expected and welcome through the Digital Curation group.

Repository Development Center

/ Office of Strategic Initiatives p.22

U PCOMING R ELEASES

  The Bagger application is nearing the completion of its development and partner testing. Bagger is meant to provide a graphical desktop to for the Bagging of content, ideally requiring no client-side IT support or infrastructure.  It is implemented as a Java Web-Start application for use across platforms as well as a standalone version with its own bundled, stripped down Java JRE, and supports the aggregation of files into Bag packages, including the creation of checksum manifests and Bag information files. It is developed on top of BIL. The Bagger review includes the proposed release of three variants – the Java Webstart version, and standalone versions for the PC and Mac – as well as the source code.  The review encompasses a number of bundled dependencies, including the redistribution license for Java.

Repository Development Center

/ Office of Strategic Initiatives p.23

B UILDING A C OMMUNITY

     The BagIt specification was posted on the Library of Congress and California Digital Library sites and as an Internet “Request for Comment” (RFC).

The BagIt specification will also be released on SourceForge to promote wider dissemination, discussion, and community building.

BagIt and the tools have been promoted to partners from three different initiatives, blogged, tweeted, shared on Facebook, presented at conferences, described in the Library’s Digital Preservation Newsletter described in email sent to listservs, discussed in a Google group, and written up in journal articles. , The team launched a Digital Curation Google group in part to support the activities of this increasingly participatory community and encourage open, public discussion. http://groups.google.com/group/digital-curation The best strategy for building a community was in its use by the NDIIPP partner institutions. NDIIPP strongly encouraged partners to “bag” their content for their preservation transfers to the Library.

Repository Development Center

/ Office of Strategic Initiatives p.24

B UILDING A C OMMUNITY

 The Library moved into new modes or promotion and community building, including development of an introductory video featuring Brian Vargas, one of the authors of the specification http://www.digitalpreservation.gov/videos/bagit0609.html

Repository Development Center

/ Office of Strategic Initiatives p.25

S UCCESSES FOR THE R ELEASE

 How is the success of this initiative measured?      There have been close to 300 downloads from the SourceForge site. The Google group has over 120 participants. A significant percentage of the 130 NDIIPP partners have utilized the BagIt specification in their preservation transfers to the Library. The Library recently become aware of the open source Ruby BagIt, a Ruby Gem released in early 2009 to support use of the specification. http://rubyforge.org/projects/bagit/

Repository Development Center

/ Office of Strategic Initiatives p.26

O UTCOMES F OR T HE L IBRARY

 The Library's first Open Source software release.

 http://sourceforge.net/projects/loc-xferutils/  BagIt is in use with multiple NDIIPP partners, in the eDeposit pilot project, and for the packaging and transport of file packages internally.

 Gradual development of graphical workflow tools for all active projects  The transfer of partner content has informed the Library’s own preservation efforts, building our understanding about what we need to know about files and what events in their life cycle we need to record and track.

 The Inventory Tool will support the Library's initial efforts in a file-level preservation audit.

 Put all tools and services into full production during 2009

Repository Development Center

/ Office of Strategic Initiatives p.27

Questions?

Leslie Johnston

[email protected]

Repository Development Center

/ Office of Strategic Initiatives p.28