Transcript Slide 1

When the 'Thing' Is a Digital
Scholarly Publication:
Connecting Publications to
Linked Data
Nancy Kopans
General Counsel, VP and Secretary, ITHAKA
How do we Organize Data?
Left Image: MIT Library, http://goo.gl/YOG25P
Right Image: Drees, DeDree. "Diderot Bugs." Flickr. Yahoo!, n.d. Web. 25
Aug. 2014. http://goo.gl/D1eAG2
Gregor Mendel
“Father of Genetics”
Studied approximately 29,000 pea plants.
Led Mendel to make the generalization
now known as Mendel’s Law of Inheritance
(dominant and recessive genes).
Image: Wellcome Library, London http://goo.gl/P56N5J
Gregor Mendel
Published his findings in 1865.
Findings were generally ignored.
Rediscovered in 1900.
In 1936, Sir Roland Fischer publishes an
article calling Mendel’s data into question,
saying it seemed too good to be true.
From 1964-2007 at least 50 papers were
published trying to untangle the
controversy of Mendel’s data.
Scientists continue to study and discuss
Mendel’s data. The controversy can never
be satisfactorily settled because Mendel’s
notebooks are missing and said to have
been burned.
Image: The Mendel Museum of genetics Brno, Abbey of St Thomas,
Brno, Czech Republic http://goo.gl/ZH3HXo
Pires, Ana M., and João A. Branco. "A Statistical Model to Explain the Mendel—Fisher Controversy." Statistical
Science (2010): 545-565.
How was data connected to scholarship?
Data was observed and summarized
Left Image:JSTOR plants Original material of Pisum sativum var. sativum L. [family FABACEAE: FABOIDEAE] http://goo.gl/sSQVvB
Right Image: Isaac Newton’s Notebook Cambridge University Library
http://goo.gl/zHt5ms
The Rise of Digital Publishing
Search for “Mendel” and “Pea” yields
156 results from 2000-2014 in less
than one second.
Photo of By Etiennekd (Own work) [CC-BYSA-3.0
(http://creativecommons.org/licenses/bysa/3.0)], via Wikimedia Commons
http://goo.gl/sMPvUw
The Rise of Digital Publishing
•
•
According to the 2012 Association of American
Publishers Journals Publishing Survey, 94% of the
journals in the survey were available in electronic
format.
Well over half of new acquisitions at all academic
libraries in the 2012 fiscal year were e-books.
Tagler, John. "2011 AAP Industry Analysis of Journals Publishing." Professional/ Scholarly Publishing Bulletin Volume
12, No. 2, Spring/Summer 2013
"Percentage of E-Books at Academic Libraries, by Institution Type, FY 2012."The Chronicle of Higher Education. The
Chronicle of Higher Education1, 18 Aug. 2014. Web. 02 Sept. 2014.
Enabling New Forms of Scholarly Publishing
• Scholars continue to expand the role of data in their
research.
• Research generates greater quantities of data than ever
before.
• The nature of digital publishing creates the possibility for
more dynamic use of data.
Examples of Dynamic Data Use in Popular Media
• Public Debt Causes Economies to Grow AND Shrink?
• The basis for Senator Paul Ryan's 2012 Federal Budget Plan was that
high public debt stifles economic growth. Based the Reinhart and Rogoff
Paper.
• Reinhart and Rogoff shared their spreadsheets with three researchers.
Based on the discovery of a coding error and the exclusion of certain
years and countries from the calculations, the researchers found the
opposite. The data leads to the conclusion that, as a general matter,
economies grow in countries with 90% public debt load.
"Holy Coding Error, Batman." Paul Krugman Holy Coding Error Batman Comments. N.p., n.d. Web. 22 Aug. 2014.
• Dinosaurs don’t grow that fast!
• Nathan P. Myhrvold, the former CTO of Microsoft, was able to highlight
problems in previously calculated dinosaur growth rates by going back
over paleontologists’ data. He published his findings in PLoS ONE.
Myhrvold NP (2013) Revisiting the Estimation of Dinosaur Growth Rates. PLoS ONE 8(12): e81917. doi:10.1371/journal.pone.0081917
Chang, Kenneth. "A Hobbyist Challenges Papers on Growth of Dinosaurs." The New York Times. The New York Times, 16 Dec. 2013.
Web. 22 Aug. 2014.
The Changing Scholarly Record
•
“The scholarly record, by virtue of its transition to digital formats, is now
much more mutable and dynamic than in the past; it is made available
through a blend of both formal and informal publication channels…[its]
boundaries are expanding to include a much wider context.”
•
Instead of “top-down” view of scholarly record, we could take a “bottom-up”
approach that enumerates the specific types of materials the scholarly
record might include.
Icon by Design Contest
http://goo.gl/bNufyN
Lavoie, Brian, Eric Childress, Ricky Erway, Ixchel Faniel, Constance Malpas, Jennifer Schaffner, and Titia van der Werf. 2014. The
Evolving Scholarly Record. Dublin, Ohio: OCLC Research. http://www.oclc.org/research/publications/library/2014/oclcresearch-evolvingscholarly-record-2014.pdf
The Changing Scholarly Record
Lavoie, Brian, Eric Childress, Ricky Erway, Ixchel Faniel, Constance Malpas, Jennifer Schaffner, and Titia van der Werf. 2014. The Evolving Scholarly
Record. Dublin, Ohio: OCLC Research. http://www.oclc.org/research/publications/library/2014/oclcresearch-evolving-scholarly-record-2014.pdf
Scholarly record: more dynamic, less “bounded”
• Formerly digital artifacts and scholarship were more or
less discrete objects.
• Increasingly, the scholarly “article” is evolving into a
multi-part, distributed object.
• The new article is broken into “building blocks” including
text, graphics and data which reside in different
repositories, maintained by different institutions,
employing different technologies.
• Evolving relationships between building blocks must be
preserved over time.
Dvortygirl. Notebook Collection. Digital image. Flickr. N.p., 26 Apr. 2008. Web. 26 Aug. 2014.
Changing Role of Data in Publications
•
Increasing expectations for connections between publications and
underlying data.
•
Importance of collaboration among publishers, data centers, and
preservation services to build tools to serve this need.
•
Goal is to preserve not just publications and data but the
relationships between and among them.
Icon by iconshock http://goo.gl/Y3Jl7w
How do we connect data to scholarship now?
LOD Cloud Diagram as of September 2011CC BY-SA 3.0
Anja Jentzsch -
Problems connecting data and scholarship persist
• “We have grown accustomed to reading papers in which tables,
figures, and statistics summarize the underlying data, but the data
themselves are unavailable.”
• “One study of articles found the odds of a dataset still existing fell by
17% each year. The odds of having a working email address for the
researcher fell 7% each year.”
Open Data and the Social Contract of Scientific Publishing Todd J. Vision
BioScience Vol. 60, No. 5 (May 2010) , pp. 330-331Published by: Oxford University Press on behalf of the American
Institute of Biological Sciences Stable URL: http://www.jstor.org/stable/10.1525/bio.2010.60.5.2
“Research Data Management in Policy and Practice: The DataRes Project.” Research Data Management: Principles, Practices, and
Prospects. Council on Library and Information Resources, 2013. 6-38. Web. 22 Aug. 2014.
<http://www.clir.org/pubs/reports/pub160/pub160.pdf>.
Vines, Timothy H., et al. "The availability of research data declines rapidly with article age." Current Biology
24.1 (2014): 94-97.
Challenges
Problems persist due to lack of:
• funding for research data management programs
• organizational structures
• professional preparation
• priority among researchers
• institutional mandates
Icon by aha-soft http://goo.gl/ShaJro
Research Data Management in Policy and Practice: The DataRes Project.” Research Data Management:
Principles, Practices, and Prospects. Council on Library and Information Resources, 2013. 6-38. Web. 22 Aug.
2014. Prospects for Research Data Management, by Martin Halbert 1-16.
<http://www.clir.org/pubs/reports/pub160/pub160.pdf>.
Efforts to Improve Data Management
Government Interest in Ensuring Access to Research Data
• In 2013, the White House Office of Science and Technology Policy
(OSTP) calls for federally funded agencies to develop plans for
public access to publications and data resulting from federal funding.
• The NIH has outlined a major program known as Big Data to
Knowledge (BD2K) and additional agencies have joined NIH and
NSF in requiring data management plans as part of the proposal
submission process.
Stebbins, Michael. "Expanding Public Access to the Results of Federally Funded Research." Web log post. The White House Blog. The White
House, 22 Feb. 2013. Web. 25 Aug. 2014. <http://www.whitehouse.gov/>.
Efforts to Improve (continued)
•
•
•
Data and Publications need to be supported in a cohesive manner.
The OAI-ORE (Open Archives Initiative - Object Reuse and Exchange)
features the concept of resource maps (ReMs) or information graphs that
describe aggregations of publications and data and—perhaps more
importantly—the relationships between them.
Private sector companies such as Google, Amazon and Facebook use their
own proprietary information graphs to describe and access content and
services. OAI-ORE resource maps are an open complement to proprietary
approaches.
Limitations
• While web-based search and discovery is an important use case, it
is not the only use case in the scholarly environment.
• In addition to modes of access not readily supported through web
browsers (e.g., visualization and simulation), preservation needs
mandate models and information graphs that account for
provenance.
• In order to support a range of diverse content and services in an
open, sustained manner, the scholarly community needs to develop
its own set of information graphs that complement government and
private sector approaches.
Potential Solution: RMap Project
• A two-year project supported by a grant from the Alfred P. Sloan
Foundation undertaken by the Data Conservancy, Portico, and
IEEE.
• Preservation of publications, their underlying data, and the complex
relationships of text, graphics, and other elements that often reside
in different repositories, maintained by different institutions
employing different technologies.
• RMap will build on the features of the semantic web and linked data,
adopting concepts from the OAI-ORE which specifies graphs that
capture the relationships among publications, data, and other
artifacts of scholarly research and communication, and facilitates the
expression of the evolution of those relationships.
For more information: http://rmap-project.info/
Antecedents to RMap Project
• arXiv.org (pronounced “archive”) is an online repository for electronic
preprints of scientific papers.
• From 2010 to March of 2013, arXiv collaborated with the Data
Conservancy to support remote data deposit for arXiv submissions.
• Authors submitted a paper for publication in arXiv along with the
data. The data was deposited with the Data Conservancy and a
bidirectional link established between the data and research.
• The pilot identified challenges such as:
• Lack of metadata
• Preservation difficulties due to wide array of file formats
Steinhart, Gail, Simeon Warner, and Oya Rieger. "ArXiv-Data Conservancy Pilot." Web log post. Digital Scholarship and Preservation
Services. DSPS Press Digital Scholarship and Preservation Services, 14 June 2014. Web. 26 Aug. 2014.
Other Projects Connecting Data and Publications
•
•
•
•
•
•
OpenAIREPlus (2009)-EU funded project aimed at linking research
publications aggregated in the OpenAire portal to the accompanying
research and author information.
DataCite (2009)- an international not-for-profit that aims to improve data
citation. Offers services such as the DataCite Metadata Store that allows
publishers to create DOIs and register the associated metadata.
LODLAM (2010)- an informal network of those interest in Linked Open Data
in Libraries, Archives and Museums.
Figshare (2011)- allows researchers to data in a citable, sharable manner.
All research made publicly available of figshare gets allocated a
DataCite DOI (digital object identifier) at point of publication.
ODIN (2012) - a tool that allows authors to link their works and research
outputs from the DataCite Metadata store with their existing profiles,
including ORCID profiles.
ORCID - is a non-profit effort to maintain a registry of unique researcher
identifiers and a transparent method of linking research activities and
outputs to these identifiers.
How RMap Advances the State-of-the-Art
•
Mapping Connections: The resulting framework and prototype will
represent the connections among cited and uncited data and publications
with a graph-based view that captures many-to-many relationships rather
than the point-to-point viewpoint of current systems.
•
Preservation: The framework will include preservation of the connection
between the data and publications.
•
Multidisciplinary: This infrastructure will be designed and prototyped with
a multidisciplinary approach from the onset, thus reducing the
dependencies or idiosyncrasies that often arise from disciplinary specific
approaches.
RMap Project Usage Scenarios
• An author submitting a paper creates an ReM.
• A publisher looks up ReMs.
• A journal uses Rmap to identify relationships between triggered
content and data.
By connecting publications, data, and researchers, and by
preserving and exposing those connections, RMap aims to enable
new forms of scholarly communication, research, and digital
publishing that will serve emerging needs for a variety of
stakeholders.
Use Case 1-Author Creates ReM
An author is about to submit a paper to a publisher and a
dataset to a repository or to a publisher and would like to
send a ReM defining the relationship with both research
outputs.
Icon by aha-soft http://goo.gl/ShaJro
Use Case 2- Publisher Looks up ReMs
At the time of publication, a publisher would like to
determine if there are any existing relationships to an
article, its author, its federal grant, etc. that it can include as
reference links.
Icon by LinhPham.me
Use Case 3- Triggering Linked Content
A journal archive is about to trigger content and
would like to identify all relationships the articles in
the triggered journal have with other resources.
Icon by iconshock http://goo.gl/Y3Jl7w
Use Case 4- Replicating Research
Researchers can retrieve relevant
datasets associated with published
articles in order to replicate
methodology.
DNA replication split" by Madprime - Own work. Licensed under Creative
Commons Attribution-Share Alike 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:DNA_replication_split.svg#mediaviewe
r/File:DNA_replication_split.svg
Continuing Challenges
•
Even with RMap and other projects, challenges remain
such as:
• Funding for data management
• Quality of data
Conclusion
• Article becomes a “thing”
http://rmap-project.info/
Conclusion
The end of conclusions?
Nancy Kopans
[email protected]