Open data and data curation

Download Report

Transcript Open data and data curation

Open data and data curation
Hamish James
Statistics New Zealand
Outline
1. Setting the scene
2. Open data
3. How open data and data curation are related
Quick definitions
information
data
structured
digital
open data
data curation
analogue
unstructured
Defining data
Data consists of sets of structured values that
can be organised, analysed and manipulated by a
software application or some other means of
calculation. This includes data collected directly
through surveys and administrative systems, as well as
data created or compiled by aggregating or
reanalysing other sources. A defining characteristic of
data is that it is machine-readable.
Open data, data curation


Open data is a philosophy based on the idea that
that data is more valuable if more people can use it,
and that technology has made the cost of sharing
data negligble
Data curation is a field of research and work
focusing on the long-term management of data,
built on the argument that the opportunity cost of
losing data is high

Open data highlights benefits

Data curation worries about costs
data
knowledge
value
Focus of open data activities
• Data collected and held by governments
• Data collected or generated through publically
funded research
• http://wiki.opengovdata.org/index.php?title=Open
DataPrinciples
Reasons to make data open
• The underlying purposes of making publically
funded data more accessible are to:
•
inform decision making by government, businesses and
communities
•
increase transparency and accountability in government decision
making
•
assist informed participation by the public in government decision
making
•
promote economic development through the innovate
application of data collected for one purpose to other tasks
•
gain greater value from research data
Barriers to reuse of government data

Agency culture (reluctance or hostility to data
sharing)

Funding constraints

Ensuring data confidentiality

Shared ownership

Poor dissemination practices
Open Government Data Principles
• Government data shall be considered open if it is made
public in a way that complies with the principles below:
1.
Complete. All public data is made available. Public data is data that is not subject to
valid privacy, security or privilege limitations.
2.
Primary. Data is as collected at the source, with the highest possible level of granularity,
not in aggregate or modified forms.
3.
Timely. Data is made available as quickly as necessary to preserve the value of the data.
4.
Accessible. Data is available to the widest range of users for the widest range of
purposes.
5.
Machine processable. Data is reasonably structured to allow automated processing.
6.
Non-discriminatory. Data is available to anyone, with no requirement of registration.
7.
Non-proprietary. Data is available in a format over which no entity has exclusive
control.
8.
License-free. Data is not subject to any copyright, patent, trademark or trade secret
regulation. Reasonable privacy, security and privilege restrictions may be allowed.
Characteristics of open data
Open data:

Free and open access to the data

Freedom to redistribute the data

Freedom to reuse the data

No restriction of the above based on who someone
is (e.g. their nationality) or their field of endeavour
(e.g. commercial or non-commercial)
c.f. http://www.okfn.org/about/
Creative Commons licence conditions
Attribution
Share-alike
No derivative works
Non-commercial
Creative Commons
Linked data
• Linked data uses semantic web approaches
(especially RDF) to describe data and make it
accessible to machines – a web of linked data
• RDF ‘triples’ are used to describe things
•
Subject – predicate – object
•
Hamish – is a – presenter
Linking Open Data dataset
cloud
What is missing?
Data needs context
Age
Of a
person
In years
46
Census
2006
As at 7
March
2006
Examples



“Which town or city in the UK has the highest
proportion of students?"
“Which town or city in the UK is home to one or
more university campuses whose registered full
or part time (non-distance) students divided by
the local population gives the largest
percentage?”
http://digitalcuration.blogspot.com/2010/03/link
ed-data-and-reality.html
re/use
render
Technology:
• Hardware
• Formats
• Software
explain
Documentation:
• Standards
• Meaning
• Interpretation
Technology to render data
data
Documentation to explain
knowledge
value
What is missing? Context
• Data is not self-describing
• Who provides the description?
• What does it cost to provide the description?
• How much of the description is held as tacit
knowledge?
•
Expert’s personal knowledge
•
Rules and meaning encoded into the data and software
Data curation
• Data curation involves:
•
Data management
•
Adding value to data
•
Data sharing for re-use
•
Data preservation for later re-use
= open data
http://www.dcc.ac.uk/news/what-makes-data-curation
= data curation
Digital Curation Centre
DDI Alliance
Open data brings benefits and risks
more users
increases
risk of
poor
analysis
highlights
data
curation
failures
open
data
justifies
data
curation
costs
expands
expert
community
pressure
for more
user
support
Complementary ideas
• Actively curated data will:
•
Remain technologically accessible
•
Be easier to understand (and therefore use)
• Data curation will benefit from data being made
more open:
•
Data that is in active use tends to remain usable
•
Widely used data is better understood than isolated data
Thank you
Hamish James
Manager, Information Management
[email protected]
04 931 4237