Architecting Extensible Digital Repository Services Robert Chavez, Robert Dockins, Anoop Kumar, Matthew Mcvey, Ranjani Saigal, Nikolai Schwertner Tufts University, Medford, MA Fedora Users Conference, Rutgers University, May.
Download ReportTranscript Architecting Extensible Digital Repository Services Robert Chavez, Robert Dockins, Anoop Kumar, Matthew Mcvey, Ranjani Saigal, Nikolai Schwertner Tufts University, Medford, MA Fedora Users Conference, Rutgers University, May.
Architecting Extensible Digital Repository Services Robert Chavez, Robert Dockins, Anoop Kumar, Matthew Mcvey, Ranjani Saigal, Nikolai Schwertner Tufts University, Medford, MA Fedora Users Conference, Rutgers University, May 13 2005 An Overview Digital Collections at Tufts Reasons for developing Tufts Digital Repository (TDR) Some design requirements and goals The TDR architecture and services Applications that interface with TDR – – Tufts Digital Library VUE Future Directions A Brief History of Digital Collections at Tufts Pre-existing Digital Projects/Libraries/Collections – – – – Perseus Digital Library Tufts University Science Knowledgebase (TUSK-Medicine) Artifact Image Library (Art History) Miscellaneous projects Crime and Punishment, Faculty Publications, Faculty Datasets, many and varied content management systems Digital Collections and Archives (DCA) steward of the University's permanently valuable digital records and collections many and varied digital collections university records Why TDR? Digital collections and materials are continually growing; adding content in a variety of formats. Original architectures and systems were not built to accommodate such expansion. Original architectures and systems were not built to facilitate interoperability or sharing of resources. Needed a university-wide digital repository that could manage the ever increasing content while continuing to service discipline specific needs and leveraging existing and new tools and services. Need for DCA to support digital data warehouse services and digital archival storage services for digital content of enduring value. Who? Digital Collections and Archives (DCA), Academic Technology (AT) – partnered to create a digital repository and digital library application for managing content while supporting teaching and learning at the university. Roles (a bit over-simplified): – – DCA: content developers, collection and deposit policy creators, managers of repository AT: content developers, applications and overall system architects and developers Design Requirements Persistence: – – – Ingest: – – – Enforce archival standards Ability to incorporate appraisal Automated ingest workflow Management: – – – Enforce unique persistent identifiers Manage identifiers for multiple projects Assurance that the data will be preserved and retrievable over time Use of information packages to facilitate storage and dissemination Incorporate content models Rights/access management Access/Interoperability: – – Digital resources should be accessible to multiple applications and systems Authorization policies must be enforced Scalability (Re)Usability – Leverage existing and new tools and services Requirements System Services Unique and persistent identification of materials Naming Service Adherence to the concept of archival information packages (AIP) Digital Object Provider (DOP) Service -- Fedora Adherence to the concept of submission information Packages (SIP) Drop Box, Ingestion Service Adherence to the concept of Dissemination Information Packages (DIP) DOP Service -- Fedora Authentication and integrity checking DOP Service, Ingestion Service Dissemination Disseminators, Caching Service, TDL, Search Service Access TDL and other applications TDR Architecture A Caching Service Interfacing Services Naming Service A P Application Interface Fedora Client U Drop Box Fedora Repository Service Ingestion Service P P - Data Provider A - Administrator U - User Arrows represent flow of data Indexing Service Search Index Search Service Application Interface U Search Interface U Services of TDR Component Role Drop Box and Ingestion Service Validation, Preprocessing, Appraisal, Transfer/Deposit Naming Service Unique persistent identifiers (URNs) mapped to objects, management of URNs, management of repositories. Mapping between existing URN schemas to Fedora schema Fedora Repository Service Management and access framework for digital objects Indexing and Search Services Metadata and full-text index creation. Search API and application Bridge Services Provides mechanisms for external applications to interface with repository Current System Architecture TDL Application How it all fits together, a working application – http://dl.tufts.edu General TDL application search transaction process U TDL App Search Service Search Interface Oracle Query Builder [JSP] [Java App.] Search Index Main Index TDL App Search Search Service [Oracle] Results Results Collation [Search Interface] [Java App.] Search Index XML index [Oracle] Naming Service URN-PID resolution [MySQL] TDL App U Disseminator Viewer [JSP] Repository Service Object Dissemination [Fedora] TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing and Search Services Interfacing Services Drop Box and Ingestion Service automate the process of preparing materials for ingest validate materials before ingest primarily for large-scale ingests not an object factory (i.e., not a tool for building individual objects) TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Services Interfacing Services Naming Service Assigns, reserves and resolves URNs – – Manages repositories – The URN has a very flexible structure that can be tailor made to suit the special needs of the particular naming convention. Example: namespace1:namescape2:namespace:3:object_id multiple production repositories, backup repositories, etc. Tufts URN Formats examples tufts:dca:central:MS102:33.1345 Perseus:text:1999.04.0006 97.5224.77-1729-47 URN Properties – – Provides unique ID to objects deposited into repository Service assures resolution to unique resource. Implementation – MySQL, Java class, JSP Management console Tufts Naming Service TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Service Interfacing Services Fedora Repository Service Fedora met many of our critical needs: – – – – – Modular nature of the repository service Management of digital content over time (versioning, etc.) Aggregation of mixed, possibly distributed, data into complex objects The ability to specify multiple content disseminations of these objects The ability to associate rights management schemes with these disseminations. Fedora Repository Service, cont… Tufts Implementation Details: – – – External data stores Modeling behaviors and content Piece of a larger architecture; not out of the box solution Tufts Repository Models/Policies – Fedora @ Tufts serves several purposes Archival/institutional repository – Data warehouse – Guarantee functional preservation Guarantee bitstream preservation Active Repository – Active workspace; constantly updated content (i.e faculty data sets, faculty pubs, content mapping) Behavior Definitions Atomic units: sets of standardized behaviors Building blocks of content models Allow for flexible reuse of data Contributes to interrepository sharing of objects Dissemination of standard output: XML, plain text, binary format Rendering/processing of disseminations is the responsibility of applications implemented over the repository. BDefs Methods tuftsAssetDef getPreview getLabel getDescription getFullView getDefaultContent getDescMetadata getAdminMetadata tuftsText getTOC getChuckList getChunk getHeader tuftsBasicImage getThumbnail getScreensize getMaxSize getDynamicView Content Models Unique content models built from content modeling components. Digital Objects that subscribe to a given content model inherit all methods established by a particular behavior. Digital objects can subscribe to content models that suit their type or class. Functional not presentation specific Implementation Challenges Processing large (>10MB) XML Documents – Processing large images – XML databases Imaging servers Streaming Media GIS data Modeling Collections Advanced Searching “Shopping cart” searching Caching Disseminations TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Service Interfacing Services Indexing Search Service Indexing – – – – Implementation – – Java indexing application Oracle database Supported Types of Search – – – Digital objects piped through from ingestion service Metadata index Full-text index Specialized XML index Basic full-text Basic metadata Advanced metadata Accessing the service – – HTTP GET/POST SOAP TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Interfacing Services Interfacing Services An important design requirement for TDR was to allow current digital library applications to easily interface with TDR and provide access to the content in the digital repository within their own environments in a seamless fashion. Current applications like VUE can interface with this service to allow their tools to disseminate the content that resides in TDL The service is being designed not only to support current applications but also to accommodate the needs of future yetto-be-defined applications like course management systems, learning tools, portals etc. Fedora OKI Bridge Fedora OKI PID Shared.Id DR DigitalRepository FedoraObject Asset FedoraObjectIterator AssetIterator BehaviorInfoStructure InfoStructure Behavior InfoRecord DisseminationInfoPart InfoPart Dissemination InfoField ParameterInfoPart InfoPart Parameter InfoField DataOutputStreamInfoPart InfoPart DataOutputStream(MIMETypeStream) InfoField Applications Accessing TDR Content Tufts Digital Library Application – http://dl.tufts.edu/ Visual Understanding Environment (VUE) – http://vue.tccs.tufts.edu/ Learning Theories - Constructivism - Active Learning - Individualized Learning VUE Overview Technical Infrastructure OKI-FEDORA Bridge VUE OKI Support - Faculty needs - Learners needs Extend - Digital Libraries - OKI Standards DR API DR Implementations FEDORA Digital Repository Digital Repository Future Directions Revised search service (Zebra?) XML database for metadata and XML objects (eXist) Customization and enhancement to address a wide variety of needs (i.e. University Records). Object factory: a workbench for building certain classes of objects Automated browsing service for Repository. Authentication and authorization modules Asset Definitions Collection Modeling Federation Asset Definitions The purpose of the Fedora Asset Definition is to define and expose content types and methods of objects/assets in a repository in a standard way. The goal is to facilitate access between applications and digital repositories, digital repositories and digital repositories, etc. Some of the questions that we asked ourselves during our repository and application development helped us form the concept of an “Asset Definition.” For example: How can an application find out what are the objects/assets within a particular repository and how does one figure out how to refer to these objects? If one has an object/asset in a repository, how does one describe it so that other applications can understand what they can do with it? Asset Definitions, cont… getFullAssetDefintion getPreview getDescription getFullView getDefaultContent getDescMetadata getAdminMetadata getThumbnail getScreenSize getMaxSize getDynamicView Collection Modeling Collection Modeling Object Relationships – – – – – – Extend Fedora RDF to create collection networks Recursive disseminators to track paths in the network Facilitate access to sets of materials Facilitate management of digital objects Facilitate browsing of sets of materials http://nikolai.tccs.tufts.edu:1980/fedora/get/demo: collectionAll/demo:Collection/viewMembers/