Introduction to SeaDataNet Metadata

Download Report

Transcript Introduction to SeaDataNet Metadata

SeaDataNet Training Course
Introduction to SeaDataNet
Metadata
Roy Lowry
British Oceanographic Data Centre
Overview
• An introduction to the
SeaDataNet metadata formats
covering
Purpose
Entity definition
History
Population
Strengths
Weaknesses
Overview
• SeaDataNet metadata formats
 European Directory of Marine Organisations
(EDMO)
 Cruise Summary Report (formerly ROSCOP)
 European Directory of Marine Environmental
Datasets (EDMED)
 European Directory of the Ocean Observing
System (EDIOS)
 SeaDataNet Common Data Index (CDI)
 European Directory of Marine Environmental
Research Projects (EDMERP)
EDMO
• Purpose
 Provides SeaDataNet with an address book of
organisations associated with marine data
 Provides descriptions of these organisations
• Entity definition
 Any group of people sharing a common postal
address engaged in activities associated with marine
data acquisition and use
• History
 Developed by Maris during SEA-SEARCH in
response to a need to improve address metadata
management across the project
EDMO
• Population
 On-line Content Management System
fronted by a web form (http://www.seasearch.net/organisations/)
 Partners are responsible for maintenance of
their national record set
 Management supported by a reasonably
sophisticated access control system that
authenticates users and grants access to
the appropriate database subset
EDMO
• Strengths
 The maintenance tool. Please use it to look after the
entries for your country
 Provides a single point of entry for SeaDataNet
metadata documents associated with a given
organisation
 Centralisation of metadata common to other
catalogues, replacing four independently maintained
address metadata repositories
 Rich information content, including descriptions,
logos and spatial location information
EDMO
• Weaknesses
 Simple data model is poorly equipped for the
management of organisational evolution
 Organisations merge, fragment, rename and move
 All we can do in EDMO is document this using plain
language fields
 Text fields contain embedded markup
 These look very nice when displayed through the
search interface
 However, the markup causes problems generating
XML documents for record transport between systems
 Examples including graphics and relative URLs break
when transported by copy/paste
CSR
• Purpose
 To document the operational and data generation activities
of an oceanographic research cruise
• Entity definition
 A subject of some controversy
 I am a metadata purist and support the definition of a
‘cruise’ as the interval of time between leaving port and
returning to port
 Thus for a 3-leg cruise I would generate 3 CSR records
whilst others would generate just one. I do this because:
 Combining records is easier than splitting them
 Cruise ‘legs’ for some ships can be VERY different (e.g. 3
legs of a Meteor cruise: one JGOFS, one OMEX, one
WOCE)
 Merging ‘legs’ is a slippery slope – I’ve even encountered a
single record covering the activities of two ships three
months apart
CSR
• Entity definition (continued)
 Problem with my definition is that the real world
creates grey areas. For example, does a personnel
change by pilot boat in an estuary count as
‘docking’?
 Others, extend the definition to cover any activity
collecting oceanographic data (shoehorning)
 I believe this is a very bad thing to do
 The activity super-class and other activity sub-classes
are much better described by other metadata
standards (e.g. in OGC Observations and
Measurements)
 Later on in SeaDataNet we could consider
incorporating some of these to further enrich our
metadata portfolio
 In the meantime remember that it is NOT necessary
to have every measurement covered by a CSR. If it
isn’t appropriate, don’t create one.
CSR
• History
 Originally a paper form developed by IOC called a
ROSCOP
 Replaced in 1990 by the Cruise Summary Report with
richer content (but the name ROSCOP stuck)
 Numerous on-line databases developed during the
1990s
 Primary repositories now DOD for SeaDataNet
partners and ICES for non-SeaDataNet
CSR
• Population
 On-line web-form (http://www.seasearch.net/roscop/welcome.html)
 XML schema available for bulk transfers
• Strengths
 Flexible population mechanisms
 Long history with a massive legacy
population
 Cruise is (or should be) a well defined
concept to oceanographers
CSR
• Weaknesses
 “Parameter” vocabulary
 Really a vocabulary describing shipborne activities
 No clear equivalent elsewhere for interoperability, but
ontological mapping to multiple vocabularies might provide
a solution
 On-line systems developed using plaintext fields when
controlled vocabularies would have made interoperability
between repositories more straightforward
 Spatial coverage limitations
 Coarse-grained
 Described using Marsden Squares but BODC has deployed
a Web Service to convert these to ISO19115/DIF standard
bounding boxes
EDMED
• Purpose
 To describe marine environmental datasets to promote
their discovery
• Entity definition
 A dataset, but what is a dataset?
 ISO19101 defines a dataset as ‘an identifiable collection of
data’ which covers everything from the parameters
measured on a single water sample to the 7,500,000 CTDs
is the USNODC World Ocean Database
 Sound judgement is needed to decide upon appropriate
granularity
 Best approach is to establish objective criteria
 Worth remembering that a measurement may be included in
more than one dataset
 Posing this question to metadata specialists can provide
good sport!
EDMED
• History
 Developed by BODC in late 80s
 Adopted by EU MAST Data Committee, then SEASEARCH and now SeaDataNet
• Population
 Form interface to stand-alone Access database that
is submitted to BODC for ingestion
 XML schema available for bulk transfers
• Strengths
 Content quality controlled on ingestion, therefore
standards are high
 Rich content developed during SEA-SEARCH
EDMED
• Weaknesses
 Developed in splendid isolation,
including vocabularies, therefore
interoperability with other systems is
difficult
 Heavy dependence on plaintext fields:
a problem that should be addressed
during SeaDataNet
EDIOS
• Purpose
 To describe marine environmental datasets
comprising data that are collected repeatedly,
regularly and routinely in order to promote their
discovery (initially for operational planning purposes)
• Entity definition
 A dataset comprised of data that are collected
repeatedly, regularly and routinely, but what is a
dataset (c.f. EDMED)?
• History
 Developed as an EU project led by EuroGOOS
 Inherited by SeaDataNet
EDIOS
• Population
 Currently an issue
 There is a Word-based form (the MIF)
– Developed in parallel to the data model and
database with no evidence of communication
– Completed MIFs entered into the database at
BODC, requiring significant interpretation and
information rehashing (long and painful process)
 SeaDataNet work in progress
– IFREMER/BODC working to produce an XML
schema to facilitate large-scale transfer
– Maris/BODC developing a web-form based
content management system along the lines of
EDMO
EDIOS
• Strengths
 Rich data model based on structured fields
with minimal plaintext
 Data model includes hierarchical
relationships between entities (project oneto-many observing programmes one-tomany measurement series)
 Data model includes support for complex
spatial objects (polygons not boxes)
 Data model is particularly well suited to the
description of operational oceanographic
systems
EDIOS
• Weaknesses
 At the start of SeaDataNet EDIOS had
17 local vocabularies
 Extremely poor content governance
 Undergoing replacement with
managed SeaDataNet standard
vocabularies (6 down 11 to go)
 Legacy content has not been
systematically quality controlled
EDIOS
• How is EDIOS different from EDMED?
 Both are content standards designed to
describe datasets
 Any dataset described by an EDMED
document could be described by an EDIOS
document and vice versa
 Once vocabularies have been harmonised
and some mappings set up it should be
possible to generate an EDMED document
from an EDIOS document
 Generation of an EDIOS document from an
EDMED document will never be possible
EDIOS
• How is EDIOS different from EDMED?
 SeaDataNet convention is to use EDIOS for
‘qualifying’ datasets and EDMED for everything else
 EDMED currently has a working population
mechanism, but EDIOS does not
 Advice to partners
 Identify datasets to be described by EDIOS
documents, map them to the EDIOS data model
(relational schema and Access prototype on BSCW)
and gather together the necessary information
 Prepare EDMED documents for all other data sets
and get them into BODC
 Submit EDIOS entries to BODC once the necessary
systems are operational
CDI
• Purpose
 To provide an ultra-light discovery metadata
description of accessible SeaDataNet data objects
 Used to build a manageable fine-grained index of
discrete data objects (millions of entries)
• Entity definition
 The fundamental SeaDataNet data delivery unit such
as a current meter record or a CTD profile
• History
 Developed by SEA-SEARCH as a pilot for SeaDataNet
CDI
• Population
 XML schema describing files that should be
generated automatically from existing digital
indexes
• Strengths
 Light content makes efficient handling of
large numbers of records possible
• Weaknesses
 Light content restricts available information
EDMERP
• Purpose
 Description of European marine research
projects and programmes
• Entity definition
 A co-ordinated collection of marine data
acquisition activities in Europe
• History
 Developed by Maris during SEA-SEARCH
EDMERP
• Population
 Access form: resulting mdb file submitted to
Maris
 On-line content management system
planned
• Strengths
 Provides centralised project metadata
• Weaknesses
 Local vocabularies and plaintext
That’s All Folks!
Questions or
Geoff?