Transcript Slide 1

SeaDataNet Training Course, Ostend, June 16-19, 2008
Data Formats, Flags
and Vocabularies
Roy Lowry
British Oceanographic Data Centre
File Formats
Available formats
Format Selection Criteria
 Types of Data
 Delivery Use Case Issues
SeaDataNet Profiling
Objectives
SeaDataNet Profiling Details
Available Formats
 Three format profiles are being
developed for SeaDataNet data transfers
 SeaDataNet ODV Profile
 Simple ASCII format based on a
spreadsheet model
 SeaDataNet MEDATLAS Profile
 Minor variation on an established ASCII
format
 SeaDataNet CF NetCDF Profile
 Binary data conforming to API and content
model based on an established community
standard (CF)
Format Selection Criteria
The $64,0000 question is “What
format should I use for my
data?”
The answer depends on the
type of data and on the data
delivery use case
Types of Data
 Think of data in terms of ‘feature types’
 Profiles (x, y, t effectively fixed: z varies)
 Bottle casts, CTDs, XBTs, radiosondes, core profiles
 Point series (x, y, z effectively fixed: t varies)
 Current meters, wave statistics, sea level, wind
velocity
 Trajectories (x, y, z (sometimes), t all vary)
 Underway data (TSG, bathymetry, meteorology),
undulator data, airborne measurements
 Grids (Two or more of x, y, z, t vary systematically )
 Satellite data, model output, synthesised data
products
Types of Data
 Most of our data may be modelled in
terms of these feature types
 For example:
 CTD data
– Modelled well by the ‘profile’ type
 Recording current meter data
– Modelled well by the ‘point series’ type
 Moored ADCP
– Modelled poorly by ‘point series’ type
(needs to be considered as one point
series per depth bin)
– But is modelled well by ‘grid’ with z, t
varying and x, y fixed
Delivery Use Case Issues
Data exchange between consenting
Mediterranean partners
 Data provider holds data in
MEDATLAS format
 Data recipient wants data in
MEDATLAS format
Could be addressed using Nemo
software to convert MEDATLAS to
ODV profile
Delivery Use Case Issues
 Problems with this approach
 Recipient needs to do unnecessary work converting
ODV to MEDATLAS
 Risk of information loss in the conversion process
 MEDATLAS is used by a significant proportion
of the SeaDataNet community
 Consequently, the transaction system
development overhead to support exchange in
MEDATLAS format was considered worthwhile
Format Recommendations
Mandatory formats
 Use ODV for
 Profiles
 Point series
 Trajectories (including underway
ADCP)
 Use NetCDF for
 Grids
 Data that don’t fit comfortably into
ODV due to shape or volume
 Data for use with NetCDF-enabled
tools
Format Recommendations
Optional format
 Use MEDATLAS for
Whatever you use MEDATLAS
for at the moment
SeaDataNet Profiling Objectives
 Two objectives
 Providing linkage between data and
SeaDataNet metadata (CDI record)
 Standardising semantics
 Consistent labelling of parameters
– Use terms from a controlled vocabulary
(more on this later)
 Consistent labelling of storage units
– Use terms from a controlled vocabulary
– Parameter definition DOES NOT dictate
storage unit
SeaDataNet ODV Profile
 Described in BSCW document (Word)
 https://www.ifremer.fr/bscw/bscw.cgi/d93460/Specific
ation%20of%20SeaDataNet%20Data%20Transport%2
0Formats
 Examples of profile, point series and
trajectory data (Excel)
 https://www.ifremer.fr/bscw/bscw.cgi/d93465/Example
s%20of%20SeaDataNet%20variant%20ODV%20sprea
dsheet-based%20import%20format
SeaDataNet ODV Profile
 ODV format based on a spreadsheet
model with three types of row
 Comment row
 One cell with text starting with //
 Column header row
 Data row
 Column header and data rows have
three types of column
 Metadata columns
 Primary variable data columns (value + flag)
 Data columns (value + flag pairs)
SeaDataNet ODV Profile
 SeaDataNet profile extensions
 CDI linkage
 Addition of two extra metadata columns
(LOCAL_CDI_ID and EDMO_code)
 Semantic mapping
 Structured comment records immediately
preceding the ODV column header record
 First record is ‘//SDN_parameter_mapping’
 Followed by one mapping record for each
data column in the file
SeaDataNet ODV Profile
 Mapping record example
 //<subject>SDN:LOCAL:Depth</subject><object>SD
N:P011::ADEPZZ01</object><units>SDN:P061::ULA
A</units>
– Subject element is the column heading text
excluding ODV units field (e.g. ‘Depth’ for ‘Depth
[m]’)
– Object element is the SeaDataNet URN for the
parameter (SDN:P011::ADEPZZ01)
– Units element is the SeaDataNet URN for the
data storage units (SDN:P061::ULAA)
 More about URNs and what we can do with them
later…..
SeaDataNet ODV Profile
 SeaDataNet Metadata and Primary Variables
 Profile data
 Metadata (x,y,t) set to nominal profile position and
time (same for every data value)
 Primary variable is the z co-ordinate (depth in metres
or pressure in decibars)
 Point series data
 Metadata (x,y,t) set to the measurement location and
series start time (same for every data value)
 Primary variable is the t co-ordinate (Chronological
Julian Day - days elapsed since 00:00 on January 1
4713 BC)
 Trajectory data
 Metadata (x,y,t) set to measurement time and position
 Primary variable is the z co-ordinate (depth in metres
or pressure in decibars)
SeaDataNet ODV Profile
 Watchpoints
 File extension should be .txt
 Field separator is the tab character (not
semi-colon)
 Physical file mapping
 The format is capable of holding multiple SeaDataNet
data objects in a single physical file
 The SeaDataNet 1 system CANNOT support this
 Means aggregation and splitting tools (or a lot of
patience!) will be required (hardly rocket science)
SeaDataNet MEDATLAS Profile
 Those who want to use MEDATLAS
know it better than me, so I’m not going
to try and teach the format!
 The most important SeaDataNet
extension is the link to CDI records,
which is done by a pair of structured
comment records for each SeaDataNet
object thus:
 *EDMO_CODE = EDMO identifier of the data
centre managing the CDI
 *LOCAL_CDI_ID = local identifier of the station
SeaDataNet MEDATLAS Profile
We can also add standardised
semantic mapping records as per
ODV such as:
 *<subject>SDN:LOCAL:Temperature</subject><object>SDN:P0
11::TEMPS901</object><units>SDN:P061::UPAA</units>
However, once the mapping
between MEDATLAS parameter
codes and P011 is completed,
these become unnecessary
SeaDataNet CF NetCDF Profile
 This is VERY immature, so currently
there is nothing to teach
 ASCII formats should be sufficient for
most SeaDataNet 1 transactions
 Further work during the next 6 months
 Partners who feel they need NetCDF for
their data should contact the Technical
Task Team (Dick Schaap or Roy Lowry)
SeaDataNet Qualifying Flags
What is a Qualifying Flag?
SeaDataNet Flags
Conflict resolution
What is a Qualifying Flag?
 Back in the mists of time (IODE in early 1980s?) it was
decreed that all data values should be accompanied by a
‘flag’ in the form of a 1-byte code
 Built into many data format specifications (MEDATLAS,
BODC PXF/QXF, GF3…)
 Initially thought of as a data quality label
 However, it provides the only metadata ‘hook’ that is
unambiguously linked to a specific data value
 Consequently, it has suffered information overload
carrying other information about non-quality issues
 We cannot correct this without major re-engineering of
data held as files, which isn’t going to happen
SeaDataNet Flags
 Information overloading has led to two
types of flag in SeaDataNet
 Quality Flags
 0 – quality unknown
 1 – good value (looks good and no reported
problems)
 2 – probably good value (associated with a
known malfunction but looks OK)
 3 – probably bad value (associated with a
known malfunction but looks wrong)
 4 – bad value (clearly wrong)
SeaDataNet Flags
 Information overloading has led to two
types of flag in SeaDataNet
 Information flags
 5 – changed value (during quality control)
 6 – below detection (true value <quoted
value)
 7 – value in excess (true value >quoted
value)
 8 – interpolated value (special case of a
changed value)
 9 – missing value
 A – phenomenon uncertain (e.g. question
over identification of biological specimen)
Conflict Resolution
We can now see the problems
caused by overloading
 How can we tell the difference
between a ‘good changed value’ and a
‘bad changed value’?
 Simple answer is the we can’t. We
can indicate the value was changed
(flag 5), good (flag 1) or bad (flag 4)
 So we have to compromise…..
Conflict Resolution
 How do we compromise?
 By prioritising flag assignments
 Initially, all flags are set to 0, 9, 7, 6 or A
(detection level and uncertainty information
comes from the originator, not QA)
 Next we either interpolate or replace and flag
appropriately (8 or 5)
 Finally we switch remaining zero flags to 1,
2, 3 or 4 as appropriate
 This is not ideal and we need to do
better in SeaDataNet 2.
Vocabularies
What are vocabularies and
mappings?
Vocabularies for Metadata
Vocabularies for Data
Vocabulary Access
Vocabulary Maintenance
What is a Vocabulary?
 A vocabulary is a list of standardised
terms used to populate a metadata field
 The SeaDataNet vocabulary model
considers each such term to possess
 A key (permanent, semantically neutral
(possibly a mnemonic) identifier for the term
 A term (full human-readable label)
 An abbreviation (short human-readable
label)
 A definition (full explanation of the term’s
meaning)
What is a Mapping?
 A mapping is a set of relationships between terms
 Each relationship consists of a subject term (sometimes
called subject concept), a predicate and an object term
 The predicate gives the relationship ‘meaning’
 Predicates may be simple to underpin something like a
thesaurus (e.g. SKOS)
 exactMatch - synonyms
 narrowMatch – subject concept totally embraces the object
concept
 broadMatch – subject concept is totally embraced by the
object concept
 majorMatch – subject and object have a lot in common but
some unique semantic elements
 minorMatch - subject and object have something in
common but significant unique semantic elements
What is a Mapping?
 Predicates may also be semantically
rich such as:
 hasUnits – links a parameter to a unit of
measurement
 isMember – links a person to a group
 hasName – links a person to a label
 Mappings between defined entities with
semantically rich predicates are what
computer scientists call an ontology
Vocabularies for Metadata
 Many fields in SeaDataNet metadata are
linked through the document schema to
appropriate vocabularies
 These cover subject areas such as:




Discovery parameters
Instruments
Platforms
Geographic locations (e.g. ports, sea areas)
 Lists to be used are defined in the
metadata guidance documentation.
 List references (e.g. P021) provide the
key to vocabulary access information
Vocabularies for Data
There are four vocabularies needed
for data in SeaDataNet
 ‘Light’ Parameter Usage Vocabulary
(P012)
 ‘The Full Monty’ Parameter Usage
Vocabulary (P011)
 SeaDataNet flags (L201)
 Units Vocabulary (P061)
Vocabularies for Data
 ‘Light’ Parameter Usage Vocabulary
(P012)
 Terms to describe parameters (i.e. column
headings)
 Kept as pure (no methods) and as simple as
possible
 Definitions available
 Mapped to MEDATLAS/GF3 extended terms
 Should be the first port of call for
SeaDataNet data providers
Vocabularies for Data
 ‘Full’ Parameter Usage Vocabulary (P011)
 Comprehensive (nearly 20,000 terms) but can be hard
to navigate
 Microsoft Access navigation tool used inside BODC
could be made available on request
 True superset of P012, so all P012 URLs have an
identical P011 equivalent
 Handling data files will be easier if P011 version is
used in SeaDataNet data files
 Port of call if P012 fails to deliver
Vocabularies for Data
 SeaDataNet data qualifier flags (L201)
 The full list of the flags discussed
previously
 Units Vocabulary (P061)
 Unlike MEDATLAS or the BODC internal
system, SeaDataNet policy is to label a value
with parameter and units INDEPENDENDLY
 The vocabulary is a standardised
description of the units used, it does not
dictate the units
 An aspiration is to develop units
interconversion based on P061 terms
Vocabulary Access
 There are five ways to access the
SeaDataNet vocabularies
 SeaDataNet Vocabulary Portal
 Term and list URLs
 HTTP-POX interface
 SOAP API
 BODC client interface
 But I’m only going to cover the first four
as the portal should cover SeaDataNet
needs
Vocabulary Access
SeaDataNet Vocabulary Portal
 User input through a web form at
http://seadatanet.maris2.nl/v_bodc_vocab/welcome.aspx
 Returns a human-readable table with
key, term, abbreviation, definition and
modification date columns
 Table may be exported as a
semicolon-delimited ‘CSV’ ASCII file
Vocabulary Access
Term and List URLs
 User input is a URL
 Returns an XML document based on
the SKOS standard
 List documents include labels and
definitions for all terms in the list
 Term documents include labels,
definition and mappings for the term
Vocabulary Access
 URL syntax





Namespace base (http://vocab.ndg.nerc.ac.uk/)
‘list’ or ‘term’
List identifier (e.g. P021)
List version or ‘current’
Term identifier for term URL (e.g. TEMP)
 Examples
 List (SeaDataNet Parameter Discovery Vocabulary)
 http://vocab.ndg.nerc.ac.uk/list/P021/current/
 Term (CF Standard Name for sea temperature)
 http://vocab.ndg.nerc.ac.uk/term/P071/current/CFSN0335
Vocabulary Access
<?xml version="1.0" ?>
- <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
- <skos:Concept
rdf:about="http://vocab.ndg.nerc.ac.uk/term/P071/7/CFSN0335">
<skos:externalID>SDN:P071:7:CFSN0335</skos:externalID>
<skos:prefLabel>sea_water_temperature</skos:prefLabel>
<skos:altLabel />
<skos:definition />
<dc:date>2008-02-26T10:02:57.564+0000</dc:date>
<skos:exactMatch
rdf:resource="http://vocab.ndg.nerc.ac.uk/term/P041/4/G963" />
<skos:broadMatch
rdf:resource="http://vocab.ndg.nerc.ac.uk/term/P021/24/TEMP" />
</skos:Concept>
</rdf:RDF>
Vocabulary Access
 In SeaDataNet data and metadata we
use URNs, not URLs (in case the server
namespace changes)
 URN syntax is




Namespace base (SDN)
List identifier (e.g. P021)
List version or null field for ‘current’
Term identifier (e.g. TEMP)
 For example the URL
http://vocab.ndg.nerc.ac.uk/list/P021/current/TEMP is
represented by the URN SDN:P021::TEMP
 URN to URL conversion is simple string slicing
Vocabulary Access
HTTP-POX API
 User input is a URL
 Returns an XML document based on a BODC-defined
schema
 Provides access to
 List catalogue
 List contents (keys, terms, abbreviations, definitions,
mappings)
 Mappings
 Plaintext searches across lists
 Term verification
 The API is documented at
http://www.bodc.ac.uk/products/web_services/vocab/methods.html
Vocabulary Access
 SOAP API
 User input is a programmatic service call from Java, Perl,
PHP, Python, etc. application
 Returns an XML document based on a BODC-defined
schema
 Provides access to
 List catalogue
 List contents (keys, terms, abbreviations, definitions,
mappings)
 Mappings
 Plaintext searches across lists
 Term verification
 The API is documented at
http://www.bodc.ac.uk/products/web_services/vocab/methods.html
 The WSDL is available from http://vocab.ndg.nerc.ac.uk/
Vocabulary Maintenance
 What if you can’t find the term you need?
 Initially contact the SeaDataNet help desk ([email protected])
 If they cannot resolve your problem they will pass the
problem on to me
 I will endeavour to add new terms or identify
appropriate existing terms
 Adding terms may involve discussions with
vocabulary governance authorities
 This can take time (possibly 2-3 weeks) so please try
to think ahead