Transcript Slide 1
SeaDataNet Training Course, Ostend, June 16-19, 2008
Data Formats, Flags
and Vocabularies
Roy Lowry
British Oceanographic Data Centre
File Formats
Available formats
Format Selection Criteria
Types of Data
Delivery Use Case Issues
SeaDataNet Profiling
Objectives
SeaDataNet Profiling Details
Available Formats
Three format profiles are being
developed for SeaDataNet data transfers
SeaDataNet ODV Profile
Simple ASCII format based on a
spreadsheet model
SeaDataNet MEDATLAS Profile
Minor variation on an established ASCII
format
SeaDataNet CF NetCDF Profile
Binary data conforming to API and content
model based on an established community
standard (CF)
Format Selection Criteria
The $64,0000 question is “What
format should I use for my
data?”
The answer depends on the
type of data and on the data
delivery use case
Types of Data
Think of data in terms of ‘feature types’
Profiles (x, y, t effectively fixed: z varies)
Bottle casts, CTDs, XBTs, radiosondes, core profiles
Point series (x, y, z effectively fixed: t varies)
Current meters, wave statistics, sea level, wind
velocity
Trajectories (x, y, z (sometimes), t all vary)
Underway data (TSG, bathymetry, meteorology),
undulator data, airborne measurements
Grids (Two or more of x, y, z, t vary systematically )
Satellite data, model output, synthesised data
products
Types of Data
Most of our data may be modelled in
terms of these feature types
For example:
CTD data
– Modelled well by the ‘profile’ type
Recording current meter data
– Modelled well by the ‘point series’ type
Moored ADCP
– Modelled poorly by ‘point series’ type
(needs to be considered as one point
series per depth bin)
– But is modelled well by ‘grid’ with z, t
varying and x, y fixed
Delivery Use Case Issues
Data exchange between consenting
Mediterranean partners
Data provider holds data in
MEDATLAS format
Data recipient wants data in
MEDATLAS format
Could be addressed using Nemo
software to convert MEDATLAS to
ODV profile
Delivery Use Case Issues
Problems with this approach
Recipient needs to do unnecessary work converting
ODV to MEDATLAS
Risk of information loss in the conversion process
MEDATLAS is used by a significant proportion
of the SeaDataNet community
Consequently, the transaction system
development overhead to support exchange in
MEDATLAS format was considered worthwhile
Format Recommendations
Mandatory formats
Use ODV for
Profiles
Point series
Trajectories (including underway
ADCP)
Use NetCDF for
Grids
Data that don’t fit comfortably into
ODV due to shape or volume
Data for use with NetCDF-enabled
tools
Format Recommendations
Optional format
Use MEDATLAS for
Whatever you use MEDATLAS
for at the moment
SeaDataNet Profiling Objectives
Two objectives
Providing linkage between data and
SeaDataNet metadata (CDI record)
Standardising semantics
Consistent labelling of parameters
– Use terms from a controlled vocabulary
(more on this later)
Consistent labelling of storage units
– Use terms from a controlled vocabulary
– Parameter definition DOES NOT dictate
storage unit
SeaDataNet ODV Profile
Described in BSCW document (Word)
https://www.ifremer.fr/bscw/bscw.cgi/d93460/Specific
ation%20of%20SeaDataNet%20Data%20Transport%2
0Formats
Examples of profile, point series and
trajectory data (Excel)
https://www.ifremer.fr/bscw/bscw.cgi/d93465/Example
s%20of%20SeaDataNet%20variant%20ODV%20sprea
dsheet-based%20import%20format
SeaDataNet ODV Profile
ODV format based on a spreadsheet
model with three types of row
Comment row
One cell with text starting with //
Column header row
Data row
Column header and data rows have
three types of column
Metadata columns
Primary variable data columns (value + flag)
Data columns (value + flag pairs)
SeaDataNet ODV Profile
SeaDataNet profile extensions
CDI linkage
Addition of two extra metadata columns
(LOCAL_CDI_ID and EDMO_code)
Semantic mapping
Structured comment records immediately
preceding the ODV column header record
First record is ‘//SDN_parameter_mapping’
Followed by one mapping record for each
data column in the file
SeaDataNet ODV Profile
Mapping record example
//<subject>SDN:LOCAL:Depth</subject><object>SD
N:P011::ADEPZZ01</object><units>SDN:P061::ULA
A</units>
– Subject element is the column heading text
excluding ODV units field (e.g. ‘Depth’ for ‘Depth
[m]’)
– Object element is the SeaDataNet URN for the
parameter (SDN:P011::ADEPZZ01)
– Units element is the SeaDataNet URN for the
data storage units (SDN:P061::ULAA)
More about URNs and what we can do with them
later…..
SeaDataNet ODV Profile
SeaDataNet Metadata and Primary Variables
Profile data
Metadata (x,y,t) set to nominal profile position and
time (same for every data value)
Primary variable is the z co-ordinate (depth in metres
or pressure in decibars)
Point series data
Metadata (x,y,t) set to the measurement location and
series start time (same for every data value)
Primary variable is the t co-ordinate (Chronological
Julian Day - days elapsed since 00:00 on January 1
4713 BC)
Trajectory data
Metadata (x,y,t) set to measurement time and position
Primary variable is the z co-ordinate (depth in metres
or pressure in decibars)
SeaDataNet ODV Profile
Watchpoints
File extension should be .txt
Field separator is the tab character (not
semi-colon)
Physical file mapping
The format is capable of holding multiple SeaDataNet
data objects in a single physical file
The SeaDataNet 1 system CANNOT support this
Means aggregation and splitting tools (or a lot of
patience!) will be required (hardly rocket science)
SeaDataNet MEDATLAS Profile
Those who want to use MEDATLAS
know it better than me, so I’m not going
to try and teach the format!
The most important SeaDataNet
extension is the link to CDI records,
which is done by a pair of structured
comment records for each SeaDataNet
object thus:
*EDMO_CODE = EDMO identifier of the data
centre managing the CDI
*LOCAL_CDI_ID = local identifier of the station
SeaDataNet MEDATLAS Profile
We can also add standardised
semantic mapping records as per
ODV such as:
*<subject>SDN:LOCAL:Temperature</subject><object>SDN:P0
11::TEMPS901</object><units>SDN:P061::UPAA</units>
However, once the mapping
between MEDATLAS parameter
codes and P011 is completed,
these become unnecessary
SeaDataNet CF NetCDF Profile
This is VERY immature, so currently
there is nothing to teach
ASCII formats should be sufficient for
most SeaDataNet 1 transactions
Further work during the next 6 months
Partners who feel they need NetCDF for
their data should contact the Technical
Task Team (Dick Schaap or Roy Lowry)
SeaDataNet Qualifying Flags
What is a Qualifying Flag?
SeaDataNet Flags
Conflict resolution
What is a Qualifying Flag?
Back in the mists of time (IODE in early 1980s?) it was
decreed that all data values should be accompanied by a
‘flag’ in the form of a 1-byte code
Built into many data format specifications (MEDATLAS,
BODC PXF/QXF, GF3…)
Initially thought of as a data quality label
However, it provides the only metadata ‘hook’ that is
unambiguously linked to a specific data value
Consequently, it has suffered information overload
carrying other information about non-quality issues
We cannot correct this without major re-engineering of
data held as files, which isn’t going to happen
SeaDataNet Flags
Information overloading has led to two
types of flag in SeaDataNet
Quality Flags
0 – quality unknown
1 – good value (looks good and no reported
problems)
2 – probably good value (associated with a
known malfunction but looks OK)
3 – probably bad value (associated with a
known malfunction but looks wrong)
4 – bad value (clearly wrong)
SeaDataNet Flags
Information overloading has led to two
types of flag in SeaDataNet
Information flags
5 – changed value (during quality control)
6 – below detection (true value <quoted
value)
7 – value in excess (true value >quoted
value)
8 – interpolated value (special case of a
changed value)
9 – missing value
A – phenomenon uncertain (e.g. question
over identification of biological specimen)
Conflict Resolution
We can now see the problems
caused by overloading
How can we tell the difference
between a ‘good changed value’ and a
‘bad changed value’?
Simple answer is the we can’t. We
can indicate the value was changed
(flag 5), good (flag 1) or bad (flag 4)
So we have to compromise…..
Conflict Resolution
How do we compromise?
By prioritising flag assignments
Initially, all flags are set to 0, 9, 7, 6 or A
(detection level and uncertainty information
comes from the originator, not QA)
Next we either interpolate or replace and flag
appropriately (8 or 5)
Finally we switch remaining zero flags to 1,
2, 3 or 4 as appropriate
This is not ideal and we need to do
better in SeaDataNet 2.
Vocabularies
What are vocabularies and
mappings?
Vocabularies for Metadata
Vocabularies for Data
Vocabulary Access
Vocabulary Maintenance
What is a Vocabulary?
A vocabulary is a list of standardised
terms used to populate a metadata field
The SeaDataNet vocabulary model
considers each such term to possess
A key (permanent, semantically neutral
(possibly a mnemonic) identifier for the term
A term (full human-readable label)
An abbreviation (short human-readable
label)
A definition (full explanation of the term’s
meaning)
What is a Mapping?
A mapping is a set of relationships between terms
Each relationship consists of a subject term (sometimes
called subject concept), a predicate and an object term
The predicate gives the relationship ‘meaning’
Predicates may be simple to underpin something like a
thesaurus (e.g. SKOS)
exactMatch - synonyms
narrowMatch – subject concept totally embraces the object
concept
broadMatch – subject concept is totally embraced by the
object concept
majorMatch – subject and object have a lot in common but
some unique semantic elements
minorMatch - subject and object have something in
common but significant unique semantic elements
What is a Mapping?
Predicates may also be semantically
rich such as:
hasUnits – links a parameter to a unit of
measurement
isMember – links a person to a group
hasName – links a person to a label
Mappings between defined entities with
semantically rich predicates are what
computer scientists call an ontology
Vocabularies for Metadata
Many fields in SeaDataNet metadata are
linked through the document schema to
appropriate vocabularies
These cover subject areas such as:
Discovery parameters
Instruments
Platforms
Geographic locations (e.g. ports, sea areas)
Lists to be used are defined in the
metadata guidance documentation.
List references (e.g. P021) provide the
key to vocabulary access information
Vocabularies for Data
There are four vocabularies needed
for data in SeaDataNet
‘Light’ Parameter Usage Vocabulary
(P012)
‘The Full Monty’ Parameter Usage
Vocabulary (P011)
SeaDataNet flags (L201)
Units Vocabulary (P061)
Vocabularies for Data
‘Light’ Parameter Usage Vocabulary
(P012)
Terms to describe parameters (i.e. column
headings)
Kept as pure (no methods) and as simple as
possible
Definitions available
Mapped to MEDATLAS/GF3 extended terms
Should be the first port of call for
SeaDataNet data providers
Vocabularies for Data
‘Full’ Parameter Usage Vocabulary (P011)
Comprehensive (nearly 20,000 terms) but can be hard
to navigate
Microsoft Access navigation tool used inside BODC
could be made available on request
True superset of P012, so all P012 URLs have an
identical P011 equivalent
Handling data files will be easier if P011 version is
used in SeaDataNet data files
Port of call if P012 fails to deliver
Vocabularies for Data
SeaDataNet data qualifier flags (L201)
The full list of the flags discussed
previously
Units Vocabulary (P061)
Unlike MEDATLAS or the BODC internal
system, SeaDataNet policy is to label a value
with parameter and units INDEPENDENDLY
The vocabulary is a standardised
description of the units used, it does not
dictate the units
An aspiration is to develop units
interconversion based on P061 terms
Vocabulary Access
There are five ways to access the
SeaDataNet vocabularies
SeaDataNet Vocabulary Portal
Term and list URLs
HTTP-POX interface
SOAP API
BODC client interface
But I’m only going to cover the first four
as the portal should cover SeaDataNet
needs
Vocabulary Access
SeaDataNet Vocabulary Portal
User input through a web form at
http://seadatanet.maris2.nl/v_bodc_vocab/welcome.aspx
Returns a human-readable table with
key, term, abbreviation, definition and
modification date columns
Table may be exported as a
semicolon-delimited ‘CSV’ ASCII file
Vocabulary Access
Term and List URLs
User input is a URL
Returns an XML document based on
the SKOS standard
List documents include labels and
definitions for all terms in the list
Term documents include labels,
definition and mappings for the term
Vocabulary Access
URL syntax
Namespace base (http://vocab.ndg.nerc.ac.uk/)
‘list’ or ‘term’
List identifier (e.g. P021)
List version or ‘current’
Term identifier for term URL (e.g. TEMP)
Examples
List (SeaDataNet Parameter Discovery Vocabulary)
http://vocab.ndg.nerc.ac.uk/list/P021/current/
Term (CF Standard Name for sea temperature)
http://vocab.ndg.nerc.ac.uk/term/P071/current/CFSN0335
Vocabulary Access
<?xml version="1.0" ?>
- <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
- <skos:Concept
rdf:about="http://vocab.ndg.nerc.ac.uk/term/P071/7/CFSN0335">
<skos:externalID>SDN:P071:7:CFSN0335</skos:externalID>
<skos:prefLabel>sea_water_temperature</skos:prefLabel>
<skos:altLabel />
<skos:definition />
<dc:date>2008-02-26T10:02:57.564+0000</dc:date>
<skos:exactMatch
rdf:resource="http://vocab.ndg.nerc.ac.uk/term/P041/4/G963" />
<skos:broadMatch
rdf:resource="http://vocab.ndg.nerc.ac.uk/term/P021/24/TEMP" />
</skos:Concept>
</rdf:RDF>
Vocabulary Access
In SeaDataNet data and metadata we
use URNs, not URLs (in case the server
namespace changes)
URN syntax is
Namespace base (SDN)
List identifier (e.g. P021)
List version or null field for ‘current’
Term identifier (e.g. TEMP)
For example the URL
http://vocab.ndg.nerc.ac.uk/list/P021/current/TEMP is
represented by the URN SDN:P021::TEMP
URN to URL conversion is simple string slicing
Vocabulary Access
HTTP-POX API
User input is a URL
Returns an XML document based on a BODC-defined
schema
Provides access to
List catalogue
List contents (keys, terms, abbreviations, definitions,
mappings)
Mappings
Plaintext searches across lists
Term verification
The API is documented at
http://www.bodc.ac.uk/products/web_services/vocab/methods.html
Vocabulary Access
SOAP API
User input is a programmatic service call from Java, Perl,
PHP, Python, etc. application
Returns an XML document based on a BODC-defined
schema
Provides access to
List catalogue
List contents (keys, terms, abbreviations, definitions,
mappings)
Mappings
Plaintext searches across lists
Term verification
The API is documented at
http://www.bodc.ac.uk/products/web_services/vocab/methods.html
The WSDL is available from http://vocab.ndg.nerc.ac.uk/
Vocabulary Maintenance
What if you can’t find the term you need?
Initially contact the SeaDataNet help desk ([email protected])
If they cannot resolve your problem they will pass the
problem on to me
I will endeavour to add new terms or identify
appropriate existing terms
Adding terms may involve discussions with
vocabulary governance authorities
This can take time (possibly 2-3 weeks) so please try
to think ahead