Transcript Document

Long-term preservation of digital
geospatial data: challenges for ensuring
access and encouraging reuse
Anne Robertson, EDINA & Steve Morris, NCSU Libraries
EDINA National Data Centre
University of Edinburgh
North Carolina State University Libraries
NCGDAP
Architecture Working Group
OGC TC/PC Meeting
Bonn, 9th November 2005
Objectives
Why we’re here………………
•
Introduce preservation and access use cases
to OGC
•
Find points of intersection with OGC
initiatives
•
Flesh out research agenda for preservation of
geospatial digital data
•
“Permanent access and reuse” not just
preservation
North Carolina Preservation Partners
•
North Carolina State University Libraries
– University-wide GIS services since 1992
– New focus on publishing WMS services for use by
external clients or service aggregators
– Archiving local agency geospatial data since 2000
•
NC Center for Geographic Information & Analysis
– State government GIS agency
– Maintains state’s Corporate Geographic Database
– Coordinates many SDI initiatives, including NC OneMap
•
NC OneMap
– Seamless access to local, state, and federal data;
component part of National Map
– WMS services available individually from sources or
through aggregator viewer
– Focus on standards, best practices, data sharing
agreements, inventories, and metadata outreach
NC Geospatial Data Archiving Project
•
Cooperative project with Library of Congress
under the National Digital Information
Infrastructure and Preservation Program
(NDIIPP)
– One of 8 NDIIPP partnership projects, others focusing on
web pages, numeric data, video, business records, etc.
– Focus on developing a network of partners, identifying
preservation issues in various domain areas
•
NCGDAP: 3 year project focused on preservation
of state and local agency digital geospatial data
– Identify and acquire data
– Develop digital repository; ingest and manage content
•
Objective: engage existing spatial data
infrastructures in process of data preservation
NCGDAP Project Phases
•
Content Identification and Selection
– Work from existing inventory processes
– Select from among “early”, “middle”, and “late” stage
information products
•
Content Acquisition
– Acquire state and local agency content
– Investigate methods of automating archive development
•
Partnership Building
– Work within NC OneMap framework (infrastructure)
– Several other emerging geo-preservation projects
•
Content Retention and Transfer
– Metadata and ingest workflow
– Emphasis on repository-agnostic approach, avoid
“imprinting” one environment
– Initially using DSpace open source software, re-ingest
into a different environment later
Common Themes – Cartographic Representation
•
The counterpart to the
map is not just the dataset
but also models,
symbology, interpretation.
These key elements give
real meaning – how are
these captured for reuse?
Common Themes – GML for archiving?
Interest in alternative to proprietary vector file formats
• “Permanent access” requirements:
•
– profiles and application schemas widely understood and supported,
avoid requiring “digital archaeology”
– Role of GML Simple Features Specification?
Assessing formats for preservation: sustainability factors,
quality & functionality factors
• Planned environmental scan of existing GML profiles and
application schemas
•
– Collaboration with National Archives and Records Administration
and FGDC Historical Data Working Group
– Vendor support? Official status? Stability over time?
•
How to handle proprietary formats?
– UC Santa Barbara/Stanford NDIIPP project working on format
registry
– Spatial databases pose special challenges
Common Themes – Content replication
•
Need efficient means to replicate content to archive
– North Carolina: 100 counties and 140 municipalities
•
Content replication also needed for:
– Disaster preparedness
– State and federal data improvement projects
– Aggregation by regional geospatial web service providers
WFS, e.g.: efficiency in complete content transfer?
• Rsync-like function, plus: rights management, inventory
processes, metadata management, informed by data
update cycles
• Archiving delta files vs. complete replication – need to
avoid requiring “digital archaeology” in the future
• Other models: LOCKSS (Lots of Copies Keeps Stuff Safe)
•
Common Themes – Time versioning
•
How to manage datasets that change over time?
– Versions will live in different repositories, must handle
relationships outside of the individual repository
•
Industry focus on most current data … but increased
demand for temporal data
– e.g., land use change detection, business trends analysis
– Much older data lost -- “Digital dark age”
•
Draft NCGDAP approach: manage information for “serial
objects” separately, link to serial entity via persistent
identifier (Handle)
– Support “get current data/metadata/DRM” operations
– Avoid managing volatile information (e.g., service connections) in
individual static metadata records
– Other technologies: OpenURL for service connections?
EDINA
•
A National Data Centre for Tertiary Education since
1995
– based at the University of Edinburgh Data Library
•
Our mission...
to enhance the productivity of research, learning and
teaching in UK higher and further education
GeoServices team - provide SDI components to UK
academic sector
• Substantial experience in handling and delivering
key geospatial data and geo-referenced
information
• OGC members since 1999
• Strategic move toward interoperability & shared
services role – use of OGC interface specifications
in our projects and services
•
GRADE project introduction
According to OECD Follow up Group on Issues of Access to Publicly
Funded Research Data1 …
“More widespread and efficient access to and sharing of
research data will have substantial benefits for most
areas of scientific research.”
Evidence of re-use of data within UK data centres is low:
– “Level of re-use of data held in the AHDS and ESRC
archives has been disappointingly low” (Alison Allden,
2003)
– “NERC spends about £5 million per annum on data
management, but unclear what benefit it derives from this.
More research is needed to establish benefits and value of
data re-use” (Mark Thorley, 2003)
– Qualidata survey of qualitative data re-use (2000). 44%
respondents used colleague's data rather than acquiring
archived data via a dissemination service (33%)
1
Interim Report, 20 October 2002
GRADE project introduction
•
•
•
•
Within UK academia there is a focus on the potential use of
digital repositories to assist with a variety of facets of digital
asset management including encouraging reuse of research
data
GRADE will investigate and report on the technical and cultural
issues around the reuse of geospatial data within the context
of discipline-based repositories
Particular focus on sharing and reuse of derived
geospatial data
EDINA leading GRADE with consortium partners:
– AHRC Research Centre for Studies in Intellectual Property and
Technology Law, School of Law, Edinburgh University
– National Oceanography Centre, Southampton University
– Variety of other associate partners including NCGDAP, British
Atmospheric Data Centre, Ordnance Survey
Common Themes – Digital Rights
•
UK environment, a complex one
– dominant provider of base vector geospatial data provider
– array of space borne survey data available, much free for noncommercial use
– Stakeholder interest from research funders (research councils) and
research hosts (institutions)
•
When we consider the reuse of derived geospatial data
concerns over data ownership, IPR and copyright often
suppress any initial enthusiasm
•
We can offer the geoDRM discussion real scenarios of
– IPR issues for derived geospatial data and
– Geospatial data reuse/sharing use cases
Derived Data Example
Input
2001
Orthophotos
Historic OS
Maps
OS Landline
Ground survey
Scan
Scan
Georeference
Georeference
Processing
GPS survey
Accuracy
assessment
Planimetric
correction
Output
ESRI Shapefile and tables of
retreat
Digitise coastline positions
Source: Use case provision of derived
geospatial data as part of the GRADE project
in scoping digital repositories (draft report)
Processing
Calculation of cliff retreat
Common Themes – Content Packaging
•
Consider a geospatial data asset deposited into a
repository, it’s more than one file:
–
–
–
–
–
GML and associated schema!
proprietary vector format plus cartographic representation detail
geodatabase
raster with header file
Data set metadata and IPR info
What is best method to package data?
• In eLibrary world the Metadata Encoding and
Transmission Standard (METS) and IMS content package
(IMS CP) and MPEG-21 DIDL for repository objects
• “Interoperable repositories need to encode, exchange
and describe complex objects in agreed ways”
• What direction is the GI industry taking with content
packaging?
•
Common Themes – Persistent Identifiers
•
Once a geospatial data asset is deposited within a
repository, there is a need to be able to persistently
identify this asset
•
Particular repository softwares use particular schemes
e.g. Fedora uses ‘info’ URI scheme
•
Requirement to ensure identifier is actionable
•
We are thinking about OpenURL Resolvers and perhaps
Digital Object Identifier (DOI) for handle schemes
•
What direction is GI industry taking with persistent
identifiers?
Common Themes – ‘data plus services’ model
National Library of New Zealand
http://wiki.tertiary.govt.nz/static/wikifarm/InstitutionalRepositories.uploads/Main/IR_report.pdf
Conclusions
•
•
•
•
•
Aim is to flesh out research agenda
Presented 7 common themes from our work
Shift to web services consumption poses threat
to secondary archive development … but can
geospatial web services be put to use in
preservation processes?
Encourage GI community to connect with these
issues or outcome may be that archive
community will fail to take account of OGC work
Where to from here?
Contact details
Anne Robertson
GRADE Project Manager
Edina National Data Centre
[email protected]
GRADE web site: http://edina.ac.uk/projects/grade
Steve Morris
Head of Digital Library Initiatives
North Carolina State University Libraries
[email protected]
NCGDAP web site: http://www.lib.ncsu.edu/ncgdap/
Questions?