Policies and procedures for data publication from the PREPARDE project Sarah Callaghan, Fiona Murphy, Jonathan Tedds, Varsha Khodiyar, John Kunze, Rebecca Lawrence, Matthew S.
Download ReportTranscript Policies and procedures for data publication from the PREPARDE project Sarah Callaghan, Fiona Murphy, Jonathan Tedds, Varsha Khodiyar, John Kunze, Rebecca Lawrence, Matthew S.
Policies and procedures for data publication from the PREPARDE project Sarah Callaghan, Fiona Murphy, Jonathan Tedds, Varsha Khodiyar, John Kunze, Rebecca Lawrence, Matthew S. Mayernik, Angus Whyte, Wil Wilcox, Timothy Roberts #preparde [email protected] @sorcha_ni Why cite and publish data? • Pressure from (UK) government to make data from publicly funded research available for free. • Scientists want attribution and credit for their work • Public want to know what the scientists are doing • Research funders want reassurance that they’re getting value for money • Relies on peer-review of science publications (well established) and data (not done yet!) http://www.evidencebasedmanagement.com/blog/2011/11/04/newevidence-on-big-bonuses/ • Allows the wider research community to find and use datasets, and understand the quality of the data • Extra incentive for scientists to submit their data to data centres in appropriate formats and with full metadata Data, Reproducibility and Science Science should be reproducible – other people doing the same experiments in the same way should get the same results. Observational data is not reproducible (unless you have a time machine!) Therefore we need to have access to the data to confirm the science is valid! http://www.flickr.com/photos/31333486@N00/1893012324/sizes/o/i n/photostream/ PREPARDE: Peer REview for Publication & Accreditation of Research Data in the Earth sciences • • • • • • • Lead Institution: University of Leicester Partners – British Atmospheric Data Centre (BADC) – US National Centre for Atmospheric Research (NCAR) – California Digital Library (CDL) – Digital Curation Centre (DCC) – University of Reading – Wiley-Blackwell – Faculty of 1000 Ltd Project Lead: Dr Jonathan Tedds (University of Leicester, [email protected]) Project Manager: Dr Sarah Callaghan (BADC, [email protected] ) Length of Project: 12 months Project Start Date: 1st July 2012 Project End Date: 31st June 2013 Geoscience Data Journal, Wiley-Blackwell and the Royal Meteorological Society • Partnership formed between Royal Meteorological Society and academic publishers Wiley Blackwell to develop a mechanism for the formal publication of data in the Open Access Geoscience Data Journal • GDJ publishes short data articles cross-linked to, and citing, datasets that have been deposited in approved data centres and awarded DOIs (or other permanent identifier). • A data article describes a dataset, giving details of its collection, processing, software, file formats, etc., without the requirement of novel analyses or ground breaking conclusions. • the when, how and why data was collected and what the data-product is. How we publish data The traditional online journal model Data 1) Author prepares the paper using word processing software. Word processing software with journal template A Journal (Any online journal system) PDF PDF PDF PDF PDF 2) Author submits the paper as a PDF/Word file. 3) Reviewer reviews the PDF file against the journal’s acceptance criteria. Overlay journal model for publishing data 1) Author prepares the data paper using word processing software and the dataset using appropriate tools. Word processing software with journal template 2a) Author submits the data paper to the journal. 2b) Author submits the dataset to a repository. Data Data Journal (Geoscience Data Journal) html Data BADC html html html Data Data BODC 3) Reviewer reviews the data paper and the dataset it points to against the journals acceptance criteria. PREPARDE topics Example steps/workflow required for a researcher to publish a data paper 3 main areas of interest (in orange) 1. Workflows and cross-linking between journal and repository 2. Repository accreditation 3. Scientific peer-review of data • Division of area of responsibilities between • repository controlled processes • journal controlled processes Workflows • Data Centres – CEDA (broken down into type of data submitter) – NCAR Earth Observing Laboratory (EOL): Computing, Data, and Software Facility – NCAR CISL Research Data Archive (RDA), http://rda.ucar.edu/ – NERC DOI minting workflow • Journals – Geoscience Data Journal – International Journal of Digital Curation (control) Data repository workflows • Workflows are very varied! No one-size fits all method • Can have multiple workflows in the same data centre, depending on interactions with external sources (“Engaged submitter”/ “Data dumper” / “Third party requester”) Repository Workflow – NCAR Comp. & Info. Systems Lab Research Data Archive (RDA) Check with data provider for changes to files Data Ingest Data Preparation: •Automated file collection. •Check integrity of file receipts. •Compare bytes and checksums (if available) with original data providers. Not ok Ok Processing: •Validate files – using software, read the full content of every file. •Pull out metadata. •Identify errors and metadata holes. •Do time-series checks. •Check metadata against internal standard/expectation. •If necessary, filter data or fix metadata. Contact data provider Errors found Notification to provider/user community Access Development Phase Embargo Online Data (Most Demanded) Archive (Tape-based) Publish Metadata – User GUIs Distribute metadata Metadata Database •Spatial info •Temporal info •Global Change Master Directory (GCMD) keywords •Parameters •Format table relationships GCMD NCAR CDP BADC Remote backup … OAI-PMH Journal workflow Aim is to minimise effort needed to submit a data paper by taking advantage of already submitted metadata. Sharing metadata also ensures that additions/corrections made in one location get propagated through to the others Generic data publication workflow. Dashed lines indicate linking (via URL) or citation (via DOI). Solid lines indicate the results or inputs into processes. Dotted line indicated where the results of a process need to be fed back into another process. Journal responsibilities are orange, data centre’s are purple Work on workflows now being extended as part of the RDA’s Workflows Working Group – part of the Publishing Data Interest Group Cross-linking BADC NCAR GDJ This is what we have to focus on for PREPARDE – demonstrate cross linking between GDJ and a data repository (BADC/NCAR) Unfortunately this direct cross-linking isn’t scaleable! Need for off-the shelf solutions that can work across multiple research domains Cross-linking – the ideal situation Registry could provide other functions as well as being an intermediary between journals and data repositories like: • Certify data centres are “trustworthy” • Administer linking mechanism • Provide search and metrics functions Registry Disadvantages: • Single point of failure • Difficulty of standardisation across different research domains Could OpenAIRE be this registry? Could DataCite? Could re3data.org? Registry would need to be discipline agnostic! Do we have a start? DataCite have standardised a set of bibliometric metadata that have to be submitted before a DOI for a dataset can be minted by a repository. Standardised metadata from repositories This metadata is then made openly available via the DataCite metadata search: http://search.datacite.org/ui Given a DOI, a journal can then easily find the DOI standard metadata. DataCite Metadata Store DataCite also have a content resolver http://data.datacite.org/static/index.html What’s missing is the return link, where the journal can let the repository know that a dataset has been cited (directly or via DataCite) Journal What PREPARDE has done NCAR BADC Standardised metadata Standardised metadata DataCite Metadata Store GDJ • We already have a link from the GDJ data article to the data repository – thanks to the DOI. • GDJ can also pull the standard DOI metadata attached to that DOI from the DataCite metadata store • GDJ needs to inform the repository that their dataset has been cited/published – bearing in mind scaling issues! • At this time, we have a manual workaround (i.e. email) Live Data paper! Dataset citation is first thing in the paper (after abstract) and is also included in reference list (to take advantage of citation count systems) DOI: 10.1002/gdj3.2 Dataset catalogue page (and DOI landing page) Reference to Data Article Clickable link to Data Article Other types of cross-linking 1. 2. 3. 4. Data repository banner ads Geographical maps Pulling metadata from the data repository into journal workflows “Data behind the graph” For each type of cross-linking we investigated: • Type of crosslinking • Reason for crosslinking • Current procedures • How to implement this crosslink in Geoscience Data Journal (GDJ) • How to roll out this crosslink to other journals • Further work and issues Data repository banner ads (1) Example banner link in a ScienceDirect article (http://www.sciencedirect.com/science/article/pii/S0921818111001159) Data repository banner ads (2) • Allows readers of the article to get to the dataset in the repository, or to the top level of the repository (where they can browse/search for the data) • Article is text mined for strings such as flags, accession numbers or names of data repositories • A taxonomy and controlled vocabulary will help automate this • Webpage real estate tends to be limited! • Flyover image and link might be more appropriate than fixed ad • Relationship between journal publishers and repositories to ensure that ad/logo is up to date and link is correct. Geographical maps (1) Example mapping of geolocation metadata in the Pangaea data repository landing page. (http://doi.pangaea.de/10.1594/PANGAEA.735719) Geographical maps (2) Example Elsevier article on ScienceDirect displaying geolocation metadata on a map for the dataset referred to in the article. Geographical maps (3) • Takes advantage of geolocation data present in the dataset’s metadata • Allows the plotting of multiple dataset locations on the same map • Option to ingest geolocation metadata from repository or from the DataCite metadata • Best not to duplicate metadata unnecessarily in different locations – i.e. keep metadata about the dataset with the dataset in the repository • Standardisation is key: ingesting metadata from multiple repositories using different methods for each is not scalable. Pulling metadata from the data repository into journal workflows (1) Example figshare widget embedded in an F1000Research paper (http://f1000research.com/articles/1-3/v1 ). The widget provides access to the data in figshare, enables the metadata to be previewed within the article, and provides repository metadata about the dataset (namely number of views, shares, downloads etc.) Pulling metadata from the data repository into journal workflows (2) • Pre-publication metadata shared between repositories and journals at article submission stage • reduces duplication of effort by author in entering dataset metadata twice • ensures consistency of information between journal and repository Possibilities: • Author inputs minimal dataset information such as DOI and the journal use the DOI to locate the metadata and add the necessary information into the journal article. • For repositories requiring significant amounts of metadata – may be possible to create a tool that automatically generates the first draft of a very structured data article. • Implementation of embedded widgets requires many-to-many relationships to be built up to map the dataset metadata appropriately • How much dataset metadata should the reviewer see on the journal site? “Data Behind the graph” (1) Active Chart created and displayed (http://activecharts.org/share/a7dd3bae149b2aba5b8f0d895e00d364 ), featuring user selection tick boxes to display/hide data and replot it. “Data Behind the graph” (2) Left: F1000Research data plotting tool showing the raw data and the author graph of that data (Nicholls et al 2013, doi: 10.12688/f1000rese arch.2-150.v1) Right: Example replot of the data using the F1000Research data plotting tool enabling readers to replot raw spreadsheet data, changing the x and yaxis as appropriate “Data Behind the graph” (3) • Accessibility to the data behind the graph enables researchers to make direct comparisons with previously published, or as yet unpublished, results. • Clicking on the plot would redirect the reader to the subset of the data used to create that plot • Issues with citation granularity • Is it appropriate to assign a DOI to just the subset of a larger dataset that underlies a particular graph? • Imagine a mixed ecosystem in the future, where repository-managed data, crosslinked with research articles, exists alongside small, specific, image-related datasets that are hosted alongside, and more closely bound to, the articles themselves. • Relies on authors being willing and able to submit the exact data subsets they used to create each figure • Will involve additional work, both in producing the subsets and archiving them Recommendations and conclusions (1) Standardisation of metadata • • Automatic processes for the linking and sharing of metadata need to be developed Require common standards PREPARDE recommend the DataCite metadata schema as a common metadata kernel for sharing and exchanging dataset metadata. It is also recommended that an agreed geolocation standard is implemented given the wide range of multidisciplinary datasets that can be combined in this way. https://xkcd.com/927/ Recommendations and conclusions (2) Use of DOIs and data citation • Use DOIs for linking data to publications • In the context of a formal data citation PREPARDE recommends the DataCite citation structure given in the DataCite metadata schema v3.0 http://schema.datacite.org , though where appropriate to the scientific domain, other permanent identifiers may be used. Citations of data should be included in the references list of the article Journal’s author guidelines should be updated to request authors to cite the datasets used in their article. Recommendations and conclusions (3) Role of a centralised, 3rd party registry • Simplify the process of passing information between data repositories and journals. As yet this registry does not exist, though some existing initiatives (DataCite, OpenAIRE) provide some aspects of the service that would be required of this registry. Although not data-related, CrossRef also provide some aspects of this registry service. PREPARDE recommend that this be investigated through the Publishing Data Interest Group of the Research Data Alliance Problems still to solve • Automatic methods for: – (Data) journal informing repository dataset has been cited – Repository linking back to paper citing dataset • Sharing of dataset metadata between repository and journal – So paper author doesn’t have to repeatedly enter metadata in multiple locations – So corrections made in one place can be propagated across • Centralised registry for crosslinking – Deal with scalability issues in direct linking between journals and repositories • Methods for issuing corrections to data after data paper has been published THE GENERAL PROBLEM HTTP://XKCD.COM/974/ Repository accreditation Link between data paper and dataset is crucial! • How do data journal editors know a repository is trustworthy? • How can repositories prove they’re trustworthy? What makes a repository trustworthy? • Many things: mission, processes, expertise, workflows, history, systems, documentation, … • Assessing trustworthiness requires assessing the entire repository workflow • • PREPARDE / IDCC13 Workshop – report at http://proj.badc.rl.ac.uk/preparde/attachment/wiki/Deli verablesList/PREPARDE_IDCC_WshopReport.pdf Peer review of data is implicitly peer review of repository Data Centre And what does “trustworthy” mean, when you get right down to it? Repository accreditation schemes: These schemes look at all of the business of running a repository, but don’t directly address the issues required for data publication. Data for publication needs to: • Be persistent • Be permanently identified • Be provided with a landing page • Have standard publication metadata • Have accessibility/licensing information Document at: http://bit.ly/ZhYHZl Feedback to: https://www.jiscmail.ac.uk/DATAPUBLICATION Repository Accreditation For data publication, a repository must be actively managed in order to: 1. Enable access to the dataset 2. Ensure dataset persistence 3. Ensure dataset stability 4. Enable searching and retrieval of datasets 5. Collect information about repository statistics Guidelines are split into general principles, and subject specific appendices. Only the Earth and Life sciences in the appendices at this time 1. Enable access to the dataset a. Ensure that data will be accessible (either as open data, or provide information on conditions of access and a clear point of contact). b. Have a policy in place allowing appropriate access for peer reviewers, as required as part of support for the data peer-review process. i. In the context of data, peer reviewers are peer reviewers are individuals with appropriate scientific and/or technical expertise who produce or use data. http://lolcatzencyclopedia.files.wordpress.com /2011/02/lolcat___computer_eating_by_tenky ougan1.jpg 2. Ensure dataset persistence (1) a. Have a clear and public assertion of responsibility to preserve the data and provide access to the data over the long term. b. Have an appropriate, formal succession plan, contingency plans, and/or escrow arrangements in place in case the repository ceases to operate or the governing or funding institution substantially changes its scope. c. Repositories must develop and implement suitable quality control and security measures to ensure the metadata is correct and the data themselves are maintained and curated to avoid degradation. i. User feedback can and should be used to strengthen and correct the metadata as needed. http://sardonicsalad.com/?p=667 2. Ensure dataset persistence (2) d. Assign globally unique persistent IDs to the published datasets and maintain a repositorymanaged URI associated with each of those IDs. These URIs should also be associated with versions of the datasets. e. Permanent IDs for the dataset must resolve to a publicly accessible landing page which must: i. be open and human readable (and it would be preferred that they should also be provided in a format which is machine readable) ii. describe the data object and include appropriate metadata and the permanent identifier (used to identify the page in the first place) iii.be maintained, even if the data has been retracted. Preserving data: how not to do it 3. Ensure dataset stability a. Stability means that the exact same version of the dataset that was cited can be returned to when the citation is resolved. b. If dataset versioning is supported, new versions should be permanently identified and linked from the original, published dataset landing page, without overwriting the original version linked from the article). The database should provide time stamped versions of archival data. http://www.ptgear.co.uk/wpcontent/uploads/2011/03/elephant-onstability-ball.jpg 4. Enable searching and retrieval of datasets a. Allow users to easily determine whether a dataset has been peer reviewed or been subject to a robust quality assurance process. b. Provide appropriate metadata about the dataset in human readable form on the landing page (see point 2.e), and when possible standardized machine readable formats e.g. DataCite metadata schema http://schema.datacite.org c. Provide appropriate information about licensing and permissions, and manage access to restricted or embargoed material as appropriate. d. Provide access to allow metadata for the datasets to be searched and retrieved through interfaces designed for both humans and computers. http://beingthecomedian.blogspot.co.uk/2011/01/we eeeee-im-horsie.html 5. Collect information about repository statistics a. Publish statistics on the level of access to any deposited item that is publicly accessible, to contribute to metrics of the item's publication impact. b. Publish information to enable journals and depositors to assess its take-up in the community it aims to serve, e.g. about any operational agreement with a well-established journal, learned society or equivalent body. http://peacock-maths.org/page28.php What we learned about repository accreditation • It is a very contentious subject! – Repository accreditation schemes exist, but don’t have significant numbers of members. – Reason for the lack of uptake of repository accreditation schemes is not clear. • Repositories feel that there is no clear benefit? • Accreditation process is unclear or too arduous and/or confusing? • Repositories seem to be content to rely on their own reputations to demonstrate their suitability as archives for data publication. – We think this will change in the near future, as data publication and data stability becomes more important. – Further work is needed to identify blockers to the uptake of repository accreditation schemes. Data Peer Review for Publishing Data Dr Jonathan Tedds [email protected] @jtedds Senior Research Fellow, Director: Health And Research Data Informatics Department of Health Sciences, University of Leicester PI #PREPARDE http://www.le.ac.uk/projects/preparde Editor-in-Chief, Open Health Data Journal (Ubiquity) Co-Chair Research Data Alliance – WDS Publishing Data IG Why open, why peer review? • • • • Science as an Open Enterprise Report As a first step towards this intelligent openness, data that underpin a journal article should be made concurrently available in an accessible database We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be interoperable. [p.7] Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science -public-enterprise/report/ Issues linking data to the scientific record: – – – • Data persistence Data and metadata quality Attribution and credit for data producers Geoffrey Boulton (Edinburgh), Lead author: – “Science has been sleepwalking into crisis of replicability...and of the credibility of science” – “Publishing articles without making the data available is scientific malpractice” Peer review of data: the Perfect Disaster? • Support for the peer review process – scholars contributing peer reviews with little formal reward – opportunity to polish and refine understanding of the cutting edge of research • But peer review system under stress – exploding number of journals, conferences, and grant applications – self-publication tools - blogs and wikis - allow scholars to disseminate their research results and products • Faster and more directly • Now adding research data into the publication and peer review queues …see Mayernik et al, accepted, BAMS! Peer-review of data • • • Technical – author guidelines for GDJ – Funder Data Value Checklist – implicit peer review of repository? Scientific – pre-publication? – post-publication? E.g. F1000R – guidelines on uncertainty e.g. IPCC – discipline specific? – EU Inspire spatial formatting Societal – contribution to human knowledge – reliability http://libguides.luc.edu/content.php?pid=5464&sid=164619 Open Peer Review of Data? ESSD peer review ensures that the datasets are: Plausible with no immediately detectable problems; Sufficient high quality and their limitations clearly stated; Well annotated by standard metadata and available from a certified data center/repository; Customary with regard to their format(s) and/or access protocol, and expected to be useable for the foreseeable future; Openly accessible (toll free) Earth System Science Data journal: http://www.earth-system-science-data.net/ Rebecca Lawrence, Data Publishing: peer review, shared standards and collaboration, http://www.dcc.ac.uk/events/research-datamanagement-forum-rdmf/rdmf8-engagingpublishers Faculty 1000 Open Peer Review Sanity check: Format and suitable basic structure adherence A standard basic protocol structure is adhered to Data stored in the most appropriate and stable location Open Peer Review: Is the method used appropriate for the scientific question being asked? Has enough information been provided to be able to replicate the experiment? Have appropriate controls been conducted, and the data presented? Is the data in a useable format/structure? Are stated data limitations and possible sources of error appropriately described Does the data ‘look’ ok (optional; e.g. Microarray data) Draft Recommendations on Peerreview of data • • • • • Summary Recommendations from Workshop at British Library, 11 March 2013 Workshop attendees included funders, publishers, repository managers, researchers …. Draft recommendations put up for discussion and feedback captured Feedback from the community still welcome 2nd workshop 24 June: put recommendations to peer reviewers! Document at: http://bit.ly/DataPRforComment http://libguides.luc.edu/content.php?pid=5464&sid=164619 Feedback to: https://www.jiscmail.ac.uk/DATAPUBLICATION Draft Recommendations on data peer review Summary Recommendations from Workshop at the British Library, 11 March 2013 • Connecting data review with data management planning • Connecting scientific, technical review and curation • Connecting data review with article review • 4-5 draft recommendations in each of above • Assist Researchers, Publishers, Journal Editors, Reviewers, Data Centres, Institutional Repositories to map requirements for data peer review • Matrix of stakeholders vs processes – Assist in assigning responsibilities for given context – New for most disciplines – Learn from disciplines where this already happens Connecting data review with data management planning 1. All research funders should at least require a “data sharing plan” as part of all funding proposals, and if a submitted data sharing plan is inadequate, appropriate amendments should be proposed. 2. Research organisations should manage research data according to recognised standards, providing relevant assurance to funders so that additional technical requirements do not need to be assessed as part of the funding application peer review. (Additional note: Research organisations need to provide adequate technical capacity to support the management of the data that the researchers generate.) 3. Research organisations and funders should ensure that adequate funding is available within an award to encourage good data management practice. 4. Data sharing plans should indicate how the data can and will be shared and publishers should refuse to publish papers which do not clearly indicate how underlying data can be accessed, where appropriate. Connecting scientific, technical review and curation 1. Articles and their underlying data or metadata (by the same or other authors) should be multi-directionally linked, with appropriate management for data versioning. 2. Journal editors should check data repository ingest policies to avoid duplication of effort , but provide further technical review of important aspects of the data where needed. (Additional note: A map of ingest/curation policies of the different repositories should be generated.) 3. If there is a practical/technical issue with data access (e.g. files don’t open or exist), then the journal should inform the repository of the issue. If there is a scientific issue with the data, then the journal should inform the author in the first instance; if the author does not respond adequately to serious issues, then the journal should inform the institution who should take the appropriate action. Repositories should have a clear policy in place to deal with any feedback. Connecting data review with article review 1. For all articles where the underlying data is being submitted, authors need to provide adequate methods and software/infrastructure information as part of their article. Publishers of these articles should have a clear data peer review process for authors and referees. 2. Publishers should provide simple and, where appropriate, discipline-specific data review (technical and scientific) checklists as basic guidance for reviewers. 3. Authors should clearly state the location of the underlying data. Publishers should provide a list of known trusted repositories or, if necessary, provide advice to authors and reviewers of alternative suitable repositories for the storage of their data. 4. For data peer review, the authors (and journal) should ensure that the data underpinning the publication, and any tools required to view it, should be fully accessible to the referee. The referees and the journal need to then ensure appropriate access is in place following publication. 5. Repositories need to provide clear terms and conditions for access, and ensure that datasets have permanent and unique identifiers. TODO • What’s missing? – Need context including long tail and international – Currently assume a lot • • publishing paradigm Processes/workflows – Suggest criteria in at least one discipline as example? • International Journal of Epidemiology & statistical review – Open community review? • Who are they for? – Long tail – Journal submission systems – model more generically • What next? – how much would it cost in resources to implement these reccs • Future RDA WG? – Practical training in data review? – RDA Workflows WG: can we map the reccs to the workflows – Is your org ready to buy into this? 6-11-2015 Launch meeting discussion 55 Please! Tell us what you think Always happy to get input from others! #preparde [email protected] [email protected] @sorcha_ni @jtedds http://citingbytes.blogspot.co.uk/ Guidelines on peer review for data: http://bit.ly/DataPRforComment Guidelines for repository accreditation for data publication: http://bit.ly/ZhYHZl Feedback to:[email protected] Project website: http://www.le.ac.uk/projects/preparde Project blog: http://proj.badc.rl.ac.uk/preparde/blog Image Credit: http://bit.ly/9H4qBX