Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar ([email protected]) Statistics Portugal Joint UNECE/Eurostat/OECD Work Session.

Download Report

Transcript Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar ([email protected]) Statistics Portugal Joint UNECE/Eurostat/OECD Work Session.

Metadata Common Vocabulary
a journey from a glossary to an ontology of
statistical metadata, and back
Sérgio Bacelar ([email protected])
Statistics Portugal
Joint UNECE/Eurostat/OECD Work Session on
Statistical Metadata (METIS)
Lisbon, 11 – 13 March, 2009
Definitions
SDMX and SDMX Content-Oriented Guidelines (COG)
Metadata Common Vocabulary (MCV)
Concepts and related definitions used in structural and
reference metadata of international organizations and
national data producing agencies.
Content Oriented Guidelines = MCV+ Cross Domain
Concepts (subset of MCV) + Statistical Subject-matter
Domains
Last version (2009): 397 terms.
Goal: uniform understanding of standard metadata
concepts.
2
ESSnet on SDMX
• Objective
– Further development of SDMX
• Further development and improvement of the SDMX
Content-oriented Guidelines
• Metadata Task Force on SDMX (Statistics Portugal)
• WP Proposal: MCV Ontology
• Metadata Common Vocabulary (MCV)
• Semantic univocity  design of a conceptual model of the
domain
• Detecting eventual inconsistencies, redundancies or
incompleteness of the glossary
• Lack of structure, flat list, non-hierarchic relations between
terms
• No semantic relations between terms
3
Conceptual system
Building a glossary implies usually a previous
design of a conceptual model of the respective
domain.
• Proposal for a revision of MCV
– Starting with the existent terms and definitions
– creating semantic relations between terms based on
the definitions of the MCV terms
• (bottom-up or middle-out strategy):
– Goal: reveal the latent conceptual system, detecting eventual
structural incongruence or redundancies.
4
Conceptual system and Concept Map
• Main goals
– find redundancies, inconsistencies, omissions, terms
belonging to other domains different from statistical
metadata (justified by the complex and
interdisciplinary nature of metadata).
– To find omitted terms (important and relevant), is
necessary to analyze the definitions of the concepts.
• Bearing this in mind we built a “Concept Map” representing
about 20% of the terms in MCV (draft version).
• A concept map is a diagram showing the relationships
among terms/concepts. Concepts are connected with labeled
arrows, in a downward-branching hierarchical structure.
• Visualization (graphical): difficult since there is a great
number of terms and relations.
5
Concept Map (partial view)
6
Concept Map (partial view)
7
Terms and relations between MCV
terms/concepts
Concept_1
Accessibility
Accounting basis
Accouting conventions
Accuracy
Adjustment
Adjustment methods
Administrative data
Administrative data
Administrative data collection
Administrative item
Administrative record
Age
Agency or organisation
Agency or organisation
Aggregation
relation
characteristic_of
type_of
same_as
characteristic_of
type_of
same_as
has_a
type_of
colection_of
part_of
part_of
attributeOf
typeOf
has
group_of
Concept_2
Quality
Methods / procedures /conventions
Accounting basis
Quality
Compilation practices
Adjustment
Administrative source
Data
Administrative data
Administrative record
Administrative data
Person
Analytical unit
Comment
Category
8
Using Resource Description
Framework (RDF)
RDF is a framework for representing information in the Web.
RDF is particularly concerned with meaning.
RDF is a collection of triples, each one consisting of a subject, a predicate and
an object: e.g. “MetadataExchange is-a DataAnd MetadataExchange”
9
Middle range solution
Using SKOS (Simple Knowledge Organization
System)
- currently developed within the W3C framework
Bridging technology between “chaos” and more
rigorous logical formalism of ontology languages
(like OWL).
It is an application of the Resource Description
Framework (RDF) providing a model for
expressing the basic structure and content of
concept schemes such as thesauri.
10
SKOS example: concept -data
<rdf:RDF
...........
<skos:Concept rdf:about=http://www.mycom/#data>
<skos:definition>Characteristics or information, usually numerical, that are
collected through observation</skos:definition>
<skos:prefLabel>data</skos:prefLabel>
<skos:altLabel></skos:altLabel>
<skos:broader rdf:resource="http://www.my.com/#information"/>
<skos:related rdf:resource="http://www.my.com/#Characteristic"/>
<skos:scopeNote>Data is the physical representation of information
in a manner suitable for communication, interpretation, or
processing by human beings or by automatic means (Economic
Commission for Europe of the United Nations (UNECE),
"Terminology on Statistical
Metadata", Conference of European
Statisticians Statistical Standards and Studies, No. 53, Geneva,
2000).</skos:scopeNote>
</skos:Concept>
</rdf:RDF>
11
Ontologies
Ontology = explicit formal specifications of
the terms in the domain (statistical
metadata) and relations among them. It is
a model of reality in the world (created
using an iterative design)
Using an editing and modeling system of
ontologies like Protégé (open source
software in http://protege.stanford.edu )
12
Ontologies reasoning
It is essential to provide tools and services
(reasoners) to help users answer queries over
ontologies and classes and instances, e.g.:
find more general/specific classes;
retrieve individual matching an existing query
ex. Is there any survey with trimestral
frequency that uses any classification system
and has a dissemination format as an on-line
database?
13
Ontologies - methodology
Developing an ontology:
1. Defining classes
2. Arranging classes in a taxonomic hierarchy
(classes and subclasses)
3. Defining slots (same as roles or properties)
4. Describing allowed values for these slots
(facets, role restrictions)
5. Filling in the values for slots for instances
(individuals)
14
Ontology - Classes
Just a first try to build an ontology of statistical metadata:
main classes created from MCV
(According to SDMX Content-Oriented Guidelines:
Framework, Draft March 2006, p.6)
1. General metadata (derived from ISO, UNECE and
UN documents);
2. Metadata describing Statistical methodologies;
3. Metadata describing Quality assessment;
4. Terms referring to Data and metadata exchange
(SDMX information model and data structure definitions,
etc.).
15
Classes and subclasses (Protégé)
16
Classes and subclasses
17
Classes and
subclasses
Quality
18
Properties
(e.g. “Quality
according to
Eurostat, has
a dimension
called
relevance”)
Class
relevance
Property
19
Codification - Ontology Web Language (OWL)
…………………..
<owl:Ontology rdf:about="">
<rdfs:comment
>Metadata Common Vocabulary (MCV) ontology.</rdfs:comment>
</owl:Ontology>
………………………
// Object Properties
<!-- http://www.semanticweb.org/ontologies/2008/8/MCV.owl#uses -->
<owl:ObjectProperty rdf:about="#uses">
<owl:inverseOf rdf:resource="#isUsedBy"/>
</owl:ObjectProperty>
………………………..
// Classes
<!-- http://www.semanticweb.org/ontologies/2008/8/MCV.owl#ComputerAssistedInterviewing -->
<owl:Class rdf:about="#ComputerAssistedInterviewing">
<rdfs:subClassOf rdf:resource="#DataCollection"/>
</owl:Class>
20
Conclusion
Since Ontology is a very strict, rigorous and formal
language to represent knowledge, mapping a
glossary like Metadata Common Vocabulary
into a Statistical Metadata Ontology can help to
reduce eventual inconsistencies,
incompleteness and lack of structure;
This may facilitate harmonization of concepts
describing data (semantic univocity) to the
SDMX users.
21