Preservation - Prelida project

Download Report

Transcript Preservation - Prelida project

Digital
Preservation
DAVID GIARET TA (APA)
F I RS T P R E L I D A W O R K S H O P, T I R R E N I A , J U N E 2 5 T H - - ‐ 2 7 T H , 2 0 1 3
Outline
Fundamental demands
Fundamental concepts
Trust
OAIS and Linked Data
Fundamental
demands
Preservation and value
Why pays?
Why?
What to preserve?
What value?
Examples
Books
Web
Science data
What are the differences?
Value
RIDING THE WAVE
Vision 2030
(2) Researchers and practitioners from any discipline
are able to find, access and process the data they
need. They can be confident in their ability to use
and understand data and they can evaluate the
degree to which the data can be trusted.
 Create a robust, reliable, flexible, green, evolvable data framework with
appropriate governance and long-term funding schemes to key services such
as Persistent Identification and registries of metadata.
 Propose a directive demanding that data descriptions and provenance are
associated with public (and other) data.
 Create a directive to set up a unified authentication and authorisation system.
 Set Grand Challenges to aggregate domains.
 Provide “forums” to define strategies at disciplinary and cross-disciplinary
levels for metadata definition.
IMPACT IF ACHIEVED
 Dramatic progress in the efficiency of the scientific process, and rapid
advances in our understanding of our complex world, enabling the best brains
to thrive wherever they are.
Vision 2030
Producers of data benefit from opening it to broad
access and prefer to deposit their data with
confidence in reliable repositories. A framework of
repositories work to international standards, to
ensure they are trustworthy.
(3)
 Propose reliable metrics to assess the quality and impact of datasets. All
agencies should recognise high quality data publication in career
advancement.
 Create instruments so long-term (rolling) EU and national funding is available
for the maintenance and curation of significant datasets.
 Help create and support international audit and certification processes.
 Link funding of repositories at EU and national level to their evaluation.
 Create the discipline of data scientist, to ensure curation and quality in all
aspects of the system.
IMPACT IF ACHIEVED
 Data-rich society with information that can be used for new and unexpected
purposes.
 Trustworthy information is useable now and for future generations.
Vision 2030
Public funding rises, because funding bodies
have confidence that their investments in research
are paying back extra dividends to society, through
increased use and re-use of publicly generated
data.
(4)
 EU and national agencies mandate that data management plans be created.
IMPACT IF ACHIEVED
 Funders have a strategic view of the value of data produced.
Vision 2030
(6) The
public has access and can make creative use
of the huge amount of data available; it can also
contribute to the data store and enrich it. All can be
adequately educated and prepared to benefit from
this abundance of information.
 Create non-specialist as well as specialist data access, visualisation, mining
and research environments.
 Create annotation services to collect views and derived results.
 Create data recommender systems.
 Embed data science in all training and academic qualifications.
 Integrate into gaming and social networks
IMPACT IF ACHIEVED
 Citizens get a better awareness of and confidence in sciences, and can play
an active role in evidence based decision making and can question
statements made in the media.
Vision 2030
Policy makers can make decisions based on
solid evidence, and can monitor the impacts of
these decisions. Government becomes more
trustworthy.
(7)
 Policy makers are able to make decisions based on solid evidence, and can
monitor the impacts of these decisions. Government becomes more
trustworthy.
IMPACT IF ACHIEVED
 Policy decisions are evidence-based to bridge the gap between society and
decision-making, and increase public confidence in political decisions.
Fundamental
concepts
OAIS
Digital Preservation…
Easy to do…
…as long as you can provide money forever
Easy to test claims about repositories…
…as long as you live a long time
Preservation techniques
For each technique
look for evidence – what evidence?
must at least make sure we consider different types of
data
◦ rendered
◦ composite
◦ dynamic
◦ active
vs non-rendered
vs simple
vs static
vs passive
must look at all types of threats
Threats
Things change……
◦Hardware
Things become unfamiliar
◦Software
◦Environment
◦Tacit knowledge
Problems when preserving data
Preserve?
Preserve what?
For how long?
How to test?
Which people?
Which organisations?
How well?
• Metadata?
– What kind? How much?
Requirements
raising tide of data…
“A fundamental characteristic of our age is the
raising tide of data – global, diverse, valuable
and complex . In the realm of science, this is both
an opportunity and a challenge.”
Report of the High-Level Group on Scientific Data, October 2010
“Riding the Wave: how Europe can gain from the raising tide of scientific data”
Who pays?
Why?
Raising tide of data…
Opportunities
Data contains numbers etc –
need meaning
20
...to be combined and processed to get
this
Processing/
combining
Processing
Level 0
Level 1
Level 2
21
Preserving digitally encoded
information
Ensure that digitally encoded information are
understandable and usable over the long
term
◦Long term could start at just a few years
◦Chain of preservation
Need to do something because things become
“unfamiliar” over time
But the same techniques enable use of data
which is “unfamiliar” right now
Lots of useful terminology
Preservation Planning
P
R
O
D
U
C
E
R
Descriptive
Information
Data
Management
Ingest
SIP
AIP
Archival
Storage
AIP
Administration
MANAGEMENT
C
Descriptive
O
Information
N
queries
query responses S
U
orders
Access
M
E
DIP
R
Key OAIS Concepts
Claiming “This is being preserved” is untestable
◦ Essentially meaningless
◦ Except “BIT PRESERVATION”
How can we make it testable?
◦ Claim to be able to continue to “do something” with it
◦ Understand/use
◦ Need Representation Information
Still meaningless…
◦ Things are too interrelated
◦ Representation Information potentially unlimited
◦ Need to define a Designated Community – those we guarantee can
understand – so we can test
OAIS Information model:
Representation Information
The Information Model is
key
Recursion ends at
KNOWLEDGEBASE of
the DESIGNATED
COMMUNITY
(this knowledge will
change over time
and region)
Does not demand that
ALL Representation
Information be
collected at once.
A process which can
be tested
Representation Network
GOCE Level 1
(N1 File Format)
GOCE Level 0
Processor
Algorithm
GOCE Level 0
GOCE N1 file
standard
GOCE N1 file
Dictionary
PDF standard
Dictionary
specification
PDF software
XML
GOCE N1 file
description
described
by
Archival
Information
Package
Package
Description
derived
from
Physical
Object
Digital
Object
1
1...*
Bit
Preservation
Description
Information
further described by
Interpreted
using
*
Interpreted
using
Packaging
Information
identifies
Content
Information
Data
Object
delimited
by
Representation
Information
Structure
Information
1
Semantic
Information
adds
meaning
to
Other
Representation
Information
Reference
Information
Provenance
Context
Information Information
Fixity
Information
Access
Rights
Information
has
Representation
Information
Provenance
has
When things changes
We need to:
◦ Know something has changed
◦ Identify the implications of that change
◦ Decide on the best course of action for preservation
◦ What RepInfo we need to fill the gaps
◦ Created by someone else or creating a new one
◦ If transformed: how to maintain data authenticity
◦ Alternatively: hand it over to another repository
◦ Make sure data continues to be usable
Transformation
Change the format e.g.
◦ Word  PDF/A
◦ PDF/A does not support macros
◦ GIF  JPEG2000
◦ Resolution/ colour depth…….
◦ Excel table  FITS file
◦ NB FITS does not support formulae
Transformational
Information
Properties
◦ Old EO or proprietary format  HDF
◦ Certainly need to change STRUCTURE RepInfo
◦ May need to change SEMANTIC RepInfo
Hand-over
Preservation requires funding
Funding for a dataset (or a repository) may stop
Need to be ready to hand over everything
needed for preservation
◦ OAIS (ISO 14721) defines “Archival Information
Package (AIP).
◦ Issues:
◦
◦
◦
◦
◦
Storage naming conventions
Representation Information
Provenance
Identifiers
….
Threat
Users may be unable to
understand or use the data e.g. the
semantics, format, processes or
algorithms involved
Requirement for solution
Abilitytoolkit,
to create
and and
maintain
RepInfo
Packager
Registryadequate
– to create Representation
and store Representation
Information
Information.
In addition the Orchestration Manager and Knowledge Gap Manager help
to ensure that the RepInfo is adequate.
Non-maintainability of essential
hardware, software or support
environment may make the
information inaccessible
Ability
share
information
about
the availability
ofabout
hardware
Registryto
and
Orchestration
Manager
to exchange
information
the
and
software
and theirand
replacements/substitutes
obsolescence
of hardware
software, amongst other changes.
The chain of evidence may be lost
and there may be lack of certainty
of provenance or authenticity
Ability to bring together evidence from diverse sources about
Authenticity toolkit will allow one to capture evidence from many
the
Authenticity of a digital object
sources which may be used to judge Authenticity.
Access and use restrictions may
make it difficult to reuse data, or
alternatively may not be respected
in future
Ability to deal with Digital Rights correctly in a changing and
evolving
environment
Packaging toolkit
to package access rights policy into AIP
Loss of ability to identify the
location of data
Persistent
Identifier
system:
a system
will allow objects to be
An
ID resolver
which
is such
really
persistent
The current custodian of the data,
whether an organisation or project,
may cease to exist at some point in
the future
Brokering of organisations to hold data and the ability to
Orchestration Manager will, amongst other things, allow the exchange of
package
the information
transfer
informationtogether
about datasets
which need toneeded
be passedtofrom
one curator to
information
between
organisations
ready
for
long
term
another.
preservation
The ones we trust to look after the
digital holdings may let us down
Certification
process so that one can have confidence about
Certification toolkit to help repository manager capture evidence for ISO
whom
to trust
to preserve data holdings over the long term
16363 Audit
and Certification
The Representation Information will include such things as software
source code and emulators.
located over time.
Infrastructure support
SCIDIP-ES
◦ Converting CASPAR prototypes into robust services
Trust
Vision 2030
Researchers and practitioners from any discipline
are able to find, access and process the data they
need. They can be confident in their ability to use
and understand data and they can evaluate the
degree to which the data can be trusted.
(2)
 Create a robust, reliable, flexible, green, evolvable data framework with
appropriate governance and long-term funding schemes to key services such
as Persistent Identification and registries of metadata.
 Propose a directive demanding that data descriptions and provenance are
associated with public (and other) data.
 Create a directive to set up a unified authentication and authorisation system.
 Set Grand Challenges to aggregate domains.
 Provide “forums” to define strategies at disciplinary and cross-disciplinary
levels for metadata definition.
IMPACT IF ACHIEVED
 Dramatic progress in the efficiency of the scientific process, and rapid
advances in our understanding of our complex world, enabling the best brains
to thrive wherever they are.
Trust
Issue
Vision 2030
Short
authenticity Scientists can establish ● Standardised
of data
the authenticity of the system for
provenance and
data they use
related evidence in
repositories.
validity of
data
Medium
Long
● Adoption of machine readable ● 80% of repositories of
provenance in major repositories scientific data have adequate
machine readable evidence
● Capture of reputation of
producers and holders (see Social ● Automated ways to
evaluate evidence of
networking)
● Standardised way
authenticity
to capture reputation
of data producers
and holders
Users and systems will Annotation system Ranking system to allow systems Systems can choose datasets
be able to evaluate the for datasets, with
to produce rankings of levels of which are most trustworthy
reputation of the data efforts to formalise trust (akin to Page rank but
and can evaluate the risks
they use.
annotation language based on reputation rather than involved in using less trusted
links)
data.
certification People can make a
International system Certification demanded by EU
of
judgement about
of repository
and national funders
repositories which repositories can certification created
be trusted
global trust Users can deal with the Discussions with US,
global datasets with China, etc
issues
the same confidence as
European sources
Complexity People can trust that Simplify interfaces
of the
the ever more complex and entanglement.
tangle of systems are
system
doing the right thing
80% of major repositories of
scientific data are certified
MOU with international agencies International agreement so
that users have evidence of
on common standards
authenticity for world-wide
scientific data
Move towards autonomic, self- Systems have survived many
configuring, self-healing, selfgenerations of changes in
optimising and self-protecting technologies and
systems, with appropriate
architectures.
monitoring.
What could jeopardise the
vision
Lack of long term investment
in critical components such as
persistent identification
Lack of preparation
Lack of willingness to cooperate across disciplines/
funders/ nations
Lack of published data
Lack of trust
Not enough data experts
The infrastructure is not used
Too complex to work
Lack of coherent data
description allowing re-use of
data
Counter by:
Reality check
Identify new funding mechanisms
Identify new sources of funding
Identify risks and benefits associated with digitally encoded information
Ensure the required research is done in advance
Apply subsidiarity principle so we do not step on researchers’ toes
Take advantage of growing need of integration: within and across disciplines
Provide ways for data producers to benefit from publishing their data
Need ways of managing reputations
Need ways of auditing and certifying repositories
Need quality, impact, and trust metrics for datasets
Need to train data scientists and to make researchers aware of the
importance of sharing their data
Work closely with real users and build according to their requirements
Make data use interesting – for example integrating into games
Use “data recommender” systems i.e. “you may also be interested in...”
Do not aim for a single top down system
Ensure effective governance and maintenance system (c.f. IETF)
Provide “forums” to define strategies at disciplinary and cross-disciplinary
levels for metadata definition
From Riding the Wave
Trust issues
Has it been preserved properly?
Is it of high quality?
Has it been changed in some way?
Does the pointer get me to the right
object?
Has it been preserved
properly?
Can the repository be trusted?
Certification of various kinds
ISO16363 certification should be available soon
 Judged on the basis of evidence collected and examined
Is it of good quality?
More than one in ten scientists and doctors claim to have
witnessed colleagues deliberately fabricating data in order to
get their research published, a new poll has revealed.
The survey of almost 2,800 experts in Britain also found six
per cent knew of possible research misconduct at their own
institution that has not been properly investigated.
The poll for the hugely-respected British Medical Journal
(BMJ)
http://www.dailymail.co.uk/sciencetech/article2085814/Scientists-falsify-data-research-publishedwhistleblowers-bullied-keeping-quiet-claim-colleagues.html
Dirk Smeesters had spent several years of his career as a social psychologist at
Erasmus University in Rotterdam studying how consumers behaved in
different situations. Did colour have an effect on what they bought? How did
death-related stories in the media affect how people picked products? And
was it better to use supermodels in cosmetics adverts than average-looking
women?
The questions are certainly intriguing, but unfortunately for anyone wanting
truthful answers, some of Smeesters' work turned out to be fraudulent. The
psychologist, who admitted "massaging" the data in some of his papers,
resigned from his position in June after being investigated by his university,
which had been tipped off by Uri Simonsohn from the University of
Pennsylvania in Philadelphia. Simonsohn carried out an independent analysis
of the data and was suspicious of how perfect many of Smeesters' results
seemed when, statistically speaking, there should have been more variation
in his measurements.
Dutch psychologist Diederik Stapel. He was found to have fabricated data for
years and published it in at least 30 peer-reviewed papers, including a report
in the journal Science about how untidy environments may encourage
discrimination.
http://www.guardian.co.uk/science/2012/sep/13/scientific-research-fraudbad-practice
Peer review of data
….is difficult
Lessons from APARSEN
Data Quality
Cost Models for preservation
Preservation tools
Preservation services
Has it been changed in some
way?
OAIS defines Authenticity as: The degree to
which a person (or system) regards an object
as what it is purported to be. Authenticity is
judged on the basis of evidence.
Need to capture evidence –what evidence?
Authenticity evidence
Authenticity Model
Provenance capture
◦ How to deal with combinations of data
◦ How to deal with changes
Security and tampering with logs
OAIS and Linked
Data
Linked Open Data: Issues
 Links – just another dataset?
 Or do we have to view as part of a huge “cloud”
 is that cloud just another dataset?






Is it just like archiving snapshots of the Web?
Snapshots? But at different times across the cloud
HTTP URIs – how persistent?
HTTP – how persistent?
RDF – how persistent?
What do the links mean?
OAIS-related issues
 Designated community
 Representation Information
 Provenance
 Rights
 Authenticity
 Trustability
 Is it easier to “poison” the system?
OAIS / Linked Data questions
Can OAIS concepts be applied to
the preservation of Linked Data?
Do existing concepts apply?
Are new concepts needed?
What new terminology is needed?
END
QUESTIONS?