Transcript Document

Fifth Conference on
Open Access Scholarly
Riga, Latvia
20 September 2013
The Open Citations Corpus
– freeing scholarly citation data
David Shotton
Oxford e-Research Centre and
Department of Zoology
University of Oxford, UK
[email protected]
© David Shotton, 2013
Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence
Scholarly communication today
Scholarly articles haven't really changed much in 346 years
4th Aug 1666
1st Jan 1888
19th March 2012
Scholarly communication – an analogy

Scholarly communication, at this mid-point in the digital revolution,is in an illdefined transitional state—a ‘horseless carriage’ state—that lies somewhere
between the world of print and paper and the world of the web and computers,
with the former still exercising significantly more influence than the latter

We started here:


We’re now here (online):
Great – that’s a significant start
Scholarly communication – an analogy

. . . but this is really where we need to be!
The importance of citations
What is a citation?

The performative act of citing a published work that is relevant to the current
work, typically made by including a reference in a reference list
Why are citations important?

The act of bibliographic citation is central to scholarly communication –
bibliographic references are the links that knit together independent
scholarship

Citations unify the whole world of scholarship into a giant citation network

Citation networks reveal the development of academic disciplines

Sir Isaac Newton: “If I have seen a little further, it is by standing on the
shoulders of Giants”
How is the present situation imperfect?

The present scholarly citation system inadequately exposes the knowledge
networks that exist within the scholarly literature, linking papers, authors,
funders, research projects and datasets

Citation data are hidden behind subscription firewalls of commercial companies

Academics are not free to use their own citation data as they please

In this Open Access age, it is a scandal that reference lists from journal articles,
the core elements of the academic data cycle, are not freely available for use by
the scholars who created them

Citation data now need to be recognized as a part of the Commons –
those works that are freely and legally available for sharing
Nomenclature and metadata
Current citation practice


Well-formed references
in reference lists
“a reference”
. . . relate to clearly
defined entities
“a reference”

But extreme ambiguity
in terminology!
“a reference”
“a reference”
Recommended
nomenclature for
references and
citations
Citing article
c4o:InTextReferencePointer
c4o:denotes
biro:BibliographicReference

This is the nomenclature
used in our SPAR
(Semantic Publishing
and Referencing)
Ontologies
http://purl.org/spar/
cito:cites
biro:references
Cited article
Generic structured metadata required to record a citation
entities
type
Citing paper
e.g. Journal article
bibliographic metadata
Title
relationship
cito:cites
Publication date
Bibliographic citation
Unique identifier
provenance
Cited
paper
Source of citation info,
e.g. CrossRef
The Open Citations Corpus
The original Open Citations Corpus

An open repository of bibliographic citation data created in 2011
 available at http://opencitations.net

Created with JISC funding of the Open Citations Project
 project blog: http://opencitations.wordpress.com/

Originally populated with ~6.4 million individual references from the reference
lists of ~200,000 articles in the Open Access Subset of PubMed Central
(as of January 2011)

These reference >3 million unique papers
 ~ 20% of all PubMed papers published between 1950 and 2010,
including all the highly cited papers in every biomedical field
 Multiple citations of the same well-cited papers permitted us to perform
error correction of the harvested citations (approx 1% erroneous)

These citations are encoded as Linked Open Data using the SPAR ontologies,
and are freely available under a CC0 waiver from http://opencitations.net/data/
Viewing citation networks at http://opencitations.net
The outward citation network of Reis et al. (2008)
Limitations of the original Open Citations Corpus

A snapshot in time of the citation data in PubMed Central as of January 2011
 becoming increasingly out of date

Contains references from open access articles only

Limited to the biomedical domain
Expanding the Open Citations Corpus
Expanding the Open Citations Corpus - Objectives

Redesign the OCC data model

Update the current ingest

Increase the domain coverage

Include reference lists from subscription-access journals

Harvest references on a continuing ongoing basis, as articles are published

Improve the user interface and the user experience

Publish the citation data both in BibJSON and in RDF as Linked Open Data

Build added value services over the citation data
Redesigning the Open Citations Corpus data model

Three record types: Entity Records, Personal Records and Citation Records

A clear separation is made between potentially erroneous citation information
'as received’ in text strings from article reference lists
 ReferenceTextRecords containing NameTextRecords (of authors, editors)
and authoritative bibliographic metadata derived from trustworthy sources
such as CrossRef, PubMed and the web pages of published articles
 BibliographicRecords

and
PersonalRecords (of authors, editors)
A distinction is also made between an UnmatchedCitationRecord
 where no BibliographicRecord exists within the OCC for the cited entity
and a MatchedCitationRecord
 where the cited entity has a BibliographicRecord within the OCC

A unique internal identifier is created for each OCC record

Provenance information details the source of each citation, the date it was
acquired, its format, and the name of the curator responsible for its ingestion
Reconfiguring the Open Citations Corpus

Underlying technical implementation being revised
 Bibliographic information encoded in BibJSON
 Data stored in BibServer, that handles BibJSON natively

Data from different sources brought into a common BibJSON format as soon as
possible

Processing the whole ingest from either source takes over 24 hours

Work still to be done on the ingest pipeline, since the parsing of citation
information from the reference list entries is not yet 100% accurate
Matching citation strings to bibliographic records

When a new reference has been extracted from a reference list
 a ReferenceTextRecord is created for the citation target, and
 an UnmatchedCitationRecord is created between the BibliographicRecord
of the citing paper and the citation target’s ReferenceTextRecord

The ReferenceTextRecord is then compared with existing BibliographicRecords

If a match is found, a new MatchedCitationRecord is created within the OCC
between the BibliographicRecords of the citing and cited entities, and

the pre-existing UnmatchedCitationRecord between the citing and cited entities
is deprecated

Similarly, a new NameTextRecord is created for each author and editor named
in the new ReferenceTextRecord, and the OCC is then searched for matches
to existing PersonalRecords within the OCC
Citation error correction

Examples of errors in reference list entries vary
 from the trivial – a non-English name with incorrect accents
or an article title containing “beta” instead of the correct “β”
 to the serious – two papers in the same reference list with the same DOI

Such errors can be detected by comparing a new ReferenceTextRecord with
pre-existing BibliographicRecords, and of a new NameTextRecord with preexisting PersonalRecords

Where there are several OCC ReferenceTextRecords referencing the same
multiply-cited paper for which an authoritative OCC BibliographicRecord does
not yet exist, we use voting algorithms for reference disambiguation and error
correction, enabling the creation of a reliable BibliographicRecord for that entity
even when we can find no external authority to provide it

In future, we wish to offer an automated OCC reference correction service to
third parties such as authors and journal editors, enabling them to spot and
correct errors in the reference lists of submitted papers before publication
New relationship types in the Open Citations Corpus
Entity type relationships

The nature of the source entity and the target entity (e.g. journal article, book,
dataset) are separately recorded in the OCC. We can thus infer the nature of
each entity type relationship, for example:
 Article-to-article bibliographic citation
 Article-to-database data citation
 Data_repository-to-article bibliographic citation
Relationships other than bibliographic citations

Additional relationship types between entities in the OCC may be encoded
using CiTO, the Citation Typing Ontology, if that information is available:
 Citation
 Shared authorship
:EntityA cito:cites :EntityB .
:EntityA cito:sharesAuthorsWith :EntityB .
 Common funding
:EntityA cito:sharesFundingAgencyWith EntityB .
 Common institution
:EntityA cito:sharesAuthorInstitutionWith :EntityB .
 Related
:EntityA dcterms:relation :EntityB .
Expansion of the Open Citations Corpus coverage

Ingest from the Open Access Subset of PubMed Central is being
updated from ~200,000 articles in Jan 2011 to the current ~658,000
articles in September 2013

Domain coverage is being expanded to include the physical sciences
and mathematics, by the ingest of the reference lists from all ~872,000
preprints in the arXiv preprint repository at Cornell University Library

This will bring the total number of references from ~6.4 million to
~40 million

We then intend to ingest all the references in CiteSeer and from
Wikipedia, marking these with clear provenance information

To this we will add citations from data repositories such as Dryad, that
contain literature references associated with the datasets they hold

and from DataCite, that issues DOIs for datasets, and harvests
metadata that contain literature references
Citations from heritage literature – ‘The Future of the Past’

Funding application just submitted to harvest references from the pre-digital
biodiversity / biological taxonomy literature, where papers have lasting value

We will use the Biodiversity Heritage Library (http://www.biodiversitylibrary.org/)
as a source of references

David King, a text mining colleague at the Open University, will use advanced
text mining techniques to dig references out of ‘dirty’ OCR’d page images

We will then ingest these data into the Open Citations Corpus and make them
freely available

This will be the only source of digital citation data from a major fraction of the
world's heritage literature in the field of biodiversity / biological taxonomy, that is
simply not available in digital form anywhere else
Additional citations from PubMed Central

There are ~2.2 million articles in PubMed Central that are not part of the Open
Access Subset, presently missing from the Open Citations Corpus

These contain citations not only to other papers, but also to datasets, typically in
the form of database accession numbers, buried within the full text or footnotes

Recent text mining initiatives undertaken by Europe PubMed Central (EPMC)
have extracted both the bibliographic citations and the data citations from all
~2.8 million PubMed Central articles, which are now freely available

We propose to ingest all these EPMC literature and data citations into the
expanded and improved Open Citations Corpus
 This will increase the number of PMC articles for which the OCC holds
citation information by about 330%
 In addition, it will further expand the nature of the citation data held to
include the data citations contained within these PMC articles

However, these are just a fraction of the total scholarly citations, most of which
are locked behind the pay walls of commercial providers
Reference lists from subscription–access articles

All fully open access publishers already publish article reference lists openly

I am working to persuade other major scholarly publishers to do the same
 i.e. to put article reference lists outside the subscription pay-wall, in the
same way as abstracts and bibliographic metadata are freely available

Last January, I published an Open Letter to Publishers requesting this
 Claire Redhead kindly distributed it to all OASPA members
 The letter is available at
http://imageweb.zoo.ox.ac.uk/pub/2013/letters/Letter_to_all_scholarly_journ
al_publishers_re_open_citations.pdf

A number of leading STM publishers have expressed their willingness to open
the reference lists from subscription-access journal articles
 Nature, Science, Taylor & Francis, Royal Society Publishing, Portland
Press, MIT Press and Oxford University Press are among the first
 another has expressed willingness verbally, but has yet to commit formally
http://opencitations.wordpress.com
Opening article reference lists via CrossRef

How can these be ingested into the Open Citations Corpus? Most publishers
already submit their reference lists to CrossRef as part of its CitedBy Linking
Service
 If you do not at present, you should use this free service!

With publisher’s permission, CrossRef can enable reference lists to be ‘opened’
 on a publisher-by-publisher basis based on DOI prefixes
 on a journal-by-journal basis
 on an article-by-article basis for hybrid journals

References are then available via the CrossRef API for ingest into the OCC

However, because the default CrossRef CitedBy Linking Service agreement is not
to publish reference lists, even Open Access publishers must
specifically inform CrossRef that the reference lists of their
journal articles should be open

Geoff Bilder has a new CrossRef Metadata Best Practice Document that I will
circulate, explaining how to specify this choice in your article metadata,
Summary - Benefits of the Open Citations Corpus

Created by scholars for scholars using scholarly data
 No profit motive constraining free publication of the data
 Will bring particular benefit to those who are NOT members of First World
academic institutions whose libraries subscribe to commercial citation
data from Thomson-Reuters or Elsevier

Will provide integrated access to citation data from a variety of sources, both
inside and outside traditional scholarly publishing, with provenance information

Data are semantically described using the SPAR bibliographic ontologies
 Citations thus become part of the Web of Linked Open Data

Data available in a variety of formats including BibJSON, BibTex and RDF for
download by third parties for their own use or to build into cool services
 indexing, search and browse (in prototype)
 timeline visualizations (in prototype)
 analysis of citation networks, co-authorship networks, etc.
 trend identification, recommendation services, etc.
Sustainability
Sustainability

The development of the Open Citations Corpus has been enabled by shortterm grant funding, but this does not provide a sustainable financial model

For the future, we seek one of the following long-term arrangements:
 Adoption by a major institutional or national library
 Adoption by a publishing organization such as CrossRef, with indirect
support from publishers
 Direct support by the scholarly publishing community
 Social investment, i.e. the provision of capital to generate social as well
as financial returns, to support open access to scholarly information
 Income support by charging for added-value services over the open data

I would be grateful for your views on the value of the Open Citations Corpus
and the manner in which its ongoing development might be supported
Acknowledgements and thanks

Alex Dutton, who developed the original Open Citations Corpus

Richard Jones, Martyn Whitwell and Mark MacGillivray of Cottage
Labs, who have undertaken more recent development work

Silvio Peroni, my colleague in developing the suite of SPAR
(Semantic Publishing and Referencing) Ontologies

The JISC, who have funded the development of the Open
Citations Corpus