Data catalogues and the data repository ADMIRe JISC MRD Dr Tom Parsons March 2013 Friday, November 06, 2015 ADMIRe.

Transcript Data catalogues and the data repository ADMIRe JISC MRD Dr Tom Parsons March 2013 Friday, November 06, 2015 ADMIRe.

Data catalogues and the data repository
ADMIRe JISC MRD
Dr Tom Parsons
March 2013
Friday, November 06, 2015
ADMIRe
1
A world-class university
• One of the world’s top 100 universities, Nottingham is
recognised globally for ground-breaking research and
teaching excellence.
• 40,000 students from more than 150 countries, two
overseas campuses and strong links with universities
around the world
• Heavily focused on research: Medical & Health Sciences,
Sciences, Engineering, Social Sciences and Arts
• Large research income (£100m) – primarily RCUK, UK/EU
government, commercial and charities
Friday, November 06, 2015
ADMIRe
2
RDM policy
“1.5. The University will provide mechanisms and services for
storage, backup, registration, deposit, retention and
preservation of research data assets in support of current
and future access, during and after completion of research
projects.”
• Key priorities for ADMIRe:
– Is the current provision good enough?
– Where are the gaps?
– What do we need to provide?
Friday, November 06, 2015
ADMIRe
3
Understanding requirements
• Approaches:
– Survey (summer 2012)
– Focus groups (November 2012)
– Interviews (May 2012 onwards)
• Mixture of ADMIRe, in-house, JISC MRD & Sero
• Outputs: service model, detailed requirements catalogue,
logical models & prototype
• Institutional requirements: “Enterprise Architecture
compliant”, use and integrate with existing systems
Friday, November 06, 2015
ADMIRe
4
Survey results: Types of data
Friday, November 06, 2015
ADMIRe
5
Survey results: Data storage
Friday, November 06, 2015
ADMIRe
6
Survey results: Metadata…
Friday, November 06, 2015
ADMIRe
7
Sharing data?
Friday, November 06, 2015
ADMIRe
8
Survey results: Total research data estimates
• From the survey’s 366 responses
• 75 Gb average (mean/frequency)
Friday, November 06, 2015
ADMIRe
9
Total research data estimates
• 75 Gb average x approx. numbers of PIs & post-grads (4000) =
300TB (+-90%)
• Large number of unknowns
• A large amount of data, a large amount of files and a good case
for managing it
Friday, November 06, 2015
ADMIRe
10
Focus groups to understand more
• Five Faculty based focus groups (30 people in total)
• Based upon California Digital Library model
Friday, November 06, 2015
ADMIRe
11
Active
data
Friday, November 06, 2015
ADMIRe
13
Archive
data
Friday, November 06, 2015
ADMIRe
14
Preservation activities
Req. Freq
Function
1 – Tag
2 – Bag
3 – Transfer
4 – Ingest
5 – Update
6 – GetDOI
7 – Publish
8 – Relocate
9 – Search
10 – Access
11 – Notify
12 – Annotate
13 - Check
14 – Report
15 - Administer
Friday, November 06, 2015
Actors
R S A
+
+
+
+
+
+
+
+
Enter metadata describing a bag of research data assets
Zip the data files up in a bag
Transfer a bag to archival storage
Ingest a bag in to storage
Update (enhance, correct) metadata for a stored bag
Get (public, private) DOIs for designated assets
Publish assets appropriately on landing pages
Relocate assets and update locators
Search for assets by keyword or field
Access metadata and data according to permissions
Notify actors automatically about data events
Create notes about a bag or its contents
Check (verify) that the contents of a bag are in order
Run reports on aspects of the system (DOI, bag, user)
Administer permissions and system parameters
M
C
C
C
O
C
C
O
M
M
O
O
M
O
M
M
M
M
M
L
L
L
L
H
M
P
L
P
L
M
ADMIRe
15
Mapping requirements
Where are we now?
Friday, November 06, 2015
ADMIRe
17
Solution
Description
Scope
Data Retention A storage platform
Storage of files and very
Platform
that enables storage basic (file type, size,
of “unstructured” data retention period, user)
files.
BPM Metastorm
frontend.
Research data Web Site. Expected
search and
to be CMS or
retrieve web
possibly SharePoint
site
Equella
Metadata Database
FAST
Search Engine
Baggit
File collection tool
Interfaces/Integrations
Direct Users
AD to support access. (Note that Researchers
Open Access will be supported by
providing a persistent account
used by the Research data web
site server that has read only
access to all “Open” data sets.
Web site with relevant
1. Data Retention Platform via
information and screens to REST to enable http(s) data
search and return results
transfer.
2. FAST (embedded function) to
allow search from a web page.
3. Equella (API) to expose
metadata onto search results.
4. Active Directory/LDAP to
authenticate file access
Stores metadata
See Metastorm, FAST and
Research Web Site
Provides search results and 1. Potential federation to Primo
rich search functionality on 2. Crawl of Equella
the metadata
Tool to assist researchers in Linked to from Metastorm
selecting and bringing files
into a collection
Those searching
for data sets
N/A
Anyone
PI
Solution
Description
DMP Online
On line tool providing support for Used to create Data
creating Data Management plan Management Plan
that is managed to ensure
Research Council Requirements
are met
DOI
Active File
Services
“Other
Repository”
Scope
Interfaces/Integrations
Direct Users
1. Metastorm will link this
PI
within curation workflow
2. Metastorm will take the XML
output of this and read key
fileds directly to automate
some metadata creation in
Equella
3. Metastorm will save the
output file of this tool
On line tool for creating a unique Workflow to fork out to See Metastorm
PI
digital object identifier
this system to allow
researcher to create a
persistent object
identifier.
File services primarily for storage
The source of files for curation PI
of active (ie not curated) files
(“Bagging”). Selectable by
browsing using Baggit tool.
Sometimes Selectable by
If used, and
browsing using Baggit tool as where possible,
the source of files for curation the DOI will point
(“Bagging”). However these
to these.
may be databases or
alternative repositories that are
used instead.
ADMIRe Phasing: Drop 1 (to June 2013)
Objective: Deliver Key Functions but without over integration
Deliverables:
1. Instructions and links on web site on how and why to use DMP Online
2. Instructions and links on web site on how and why to use DOI
3. Implementation (but not integration) of Baggit for Research users
4. Delivery of Metadata in Equella
Including instructions and links on web site on how and why to use
5. Creation of Research Data Search Page
Including instructions and links on web site on how and why to use
Implementation of FAST search crawl
Embed of FAST in web page
Delivery of Results page to include relevant information
6. Metastorm development that:
Creates User (PI Researcher) interface to Equella
Provides fields to add all metadata into Equella
Including Research Project Information, Subject Specific Information, Technical Metadata
Allows Researcher to choose when a page is searchable
Friday, November 06, 2015
ADMIRe
ADMIRe Phasing: Drop 2 (to Dec 2013)
Deliverables
1. Delivery of Retention platform
• Delivered outside of ADMIRe project
2. Delivery of Open Access Platform
• (Subset of Retention platform)
3. Definition and Delivery of
• End to end workflow automation and integration for data
management process with a vision of “Input Once”
• Integrations of Baggit, Agresso Awards Management, DMP Online,
DOI
4. Definition and Delivery of a report for Research Councils that
• Confirms project adherence (at Project close) to funding
requirements for data management and access
• Enables non-conformance to be addressed
Friday, November 06, 2015
ADMIRe
Reusable outputs
•
•
•
•
•
Focus groups/interview formats
Requirements catalogue
Use cases
Survey – questions, write-up etc
Software? No…
Friday, November 06, 2015
ADMIRe
22
Questions?
[email protected]
ADMIRe Project Manager
Friday, November 06, 2015
ADMIRe
23