Toward a Global Infrstructure for Data and Metadata: The

Download Report

Transcript Toward a Global Infrstructure for Data and Metadata: The

Toward a Global Infrastructure for
Data and Metadata:
The Open Data Foundation
Arofan Gregory
Executive Manager
The Open Data Foundation
Something Really Amazing
• Spaceships
aren’t that
amazing…
• Aliens aren’t that
amazing…
• Mobile telephones aren’t
that amazing…
aren’t that
amazing…
• These devices have
access to the complete set
of human (well, Federation)
knowledge, via ship’s
computer - That’s
AMAZING!
• Tasers
An Epic Feat of Data Standardization!
A Big Idea
• It might seem too outrageous to imagine that
every data source could be accessible and
usable via a global network, but…
– Consider all the domain “grids” which are emerging
– Consider the number of modern technologies for
leveraging data across networks
– Consider the tools we have for solving problems of
semantic interoperability
• Maybe Star Trek was only a few decades ahead
of its time!
Something Missing…
• Technology alone cannot solve this problem
• For centuries, scientists, librarians, and
archivists have worked to perfect taxonomies
and classifications for organizing and accessing
human knowledge
– Technologists can’t replace the disciplines which have
evolved from this work with technology alone
– They can only automate it
• Having an ontology doesn’t mean you have an
agreed, tried, and workable standard
classification system!
– A thousand little ontologies still produce chaos!
Why Now?
• The idea of a global data infrastructure is
practical today because…
– We have good, standards-based, networked
technology
– We have a highly sophisticated population of
archivists and librarians who understand the
challenges of large-scale classification, for all
types of media
– We have an emerging culture of data
producers and users who are beginning to
understand the potential offered by modern
technology
The Open Data Movement
From Wikipedia:
“Open Data is a philosophy and practice requiring
that certain data are freely available to
everyone, without restrictions from copyright,
patents or other mechanisms of control. It has a
similar ethos to a number of other "Open"
movements and communities such as Open
Source and Open Access.”
The Open Data Foundation (ODaF)
• Although we respect this traditional goal of the
Open Data movement, we feel that the
technology issues, as opposed to the legal ones,
have a different focus:
– Much public data is inaccessible or unusable
– Confidential data is less accessible than it could be
– The collection and publication of some critical data is
lacking, notably in the Developing World
• It is not enough to put the rights to data into the
public domain – it must also be practically
accessible to all potential users
What Do We Mean by “Data”?
• Official statistics collected by government
agencies and international organizations
– Usually aggregates and time-series data
– Covers a huge range of social, scientific, and
economic topics
• Numeric research data supporting social
sciences and hard sciences
– Often lower-level “microdata”
– May be gathered by survey or sourced from registers
• Qualitative data used in social sciences research
– Not research papers, but source data (eg, interviews)
ODaF’s Mission
• To bring together individuals from the statistics
community, the research community, and the
technology standards community
• To promote the creation of a global infrastructure
for data and metadata by providing open-source
tools and supporting the adoption of a
coordinated set of open technology standards
• To promote the creation and use of knowledge,
and fact-based decision-making, through
improved access to data and metadata
ODaF - Timeline
• The idea started at IASSIST 2006 in Edinburgh
• Incorporated in 2006 as a US scientific non-profit
• First face-to-face meeting in Washington DC in
December 2006 at the National Opinion Research
Center (NORC)
• September 2007: next face-to-face meeting in St.
Helena, California
• Next face-to-face meeting: NORC in DC, December
2007, followed by a European meeting (UK,
Netherlands, or Germany) in early 2008
• NOTE: We are a virtual organization – we don’t rely on
face-to-face meetings for conducting work (Thanks,
Skype!)
ODaF - Directors
– Bob Glushko – head of the UC Berkeley
Center for Document Engineering and
member of OASIS Board of Directors
– Julia Lane – Vice President, NORC and
world-class expert in data confidentiality
issues
– Ernie Boyko, former President of IASSIST
– Rune Gloersen – head of IT at Statistics
Norway
ODaF - Executive Managers
• Arofan Gregory – background in SGML/XML, technology
standards (notably ebXML, UBL, UN/CEFACT, ISO
TC154, DDI, and SDMX)
• Pascal Heus - lead developer for World Bank and
International Household Survey Network, much
experience with field-work in Africa, DDI implementor
• Chris Nelson – veteran OMGer (CWM), worked with
many technology standards (UN/EDIFACT, GESMES,
ebXML, SDMX, DDI), consummate UML modeler
• Jostein Ryssevik – former CEO of Nesstar North
America, now with Ideas2Evidence, associated with
Gallup Europe; longtime DDI implementor
ODaF - Advisors
• Sandra Cannon - Board of Governors of the
Federal Reserve System
• Gilles Collette- Visual Communications, PanAmerican Health Organization (WHO)
• Daniel Gillman - US Bureau of Labor Statistics
• Eduardo Gutentag – Chair, OASIS Board of
Directors
• Paul Johanis - Statistics Canada
• Graeme Oakley - Australian Bureau of Statistics
• Dr. Andrew Nelson - Joint Research Centre of
the European Commission
ODaF – Advisors (cont.)
• Ken Miller- UK Data Archive / Economic and Social Data
Service
• Duane Nickull- Chair, OASIS SOA Reference
Architecture TC
• Juraj Riecan - United Nations Economic Commission for
Europe (UNECE)
• Gerard Salou - European Central Bank
• Professor Bo Sundgren, Ph.D - Statistics Sweden
• Wendy Thomas - Minnesota Population Center,
University of Minnesota
• Wendy Watkins - Data Centre Coordinator, Maps, Data
and Government Information Centre, Carleton University
Library
ODaF - Organization
• We are project-oriented:
– Any member can participate in projects
• May be paid consultants for specific work, or volunteers
– Project proposal is put before Directors by
Management team in consultation with Board of
Advisors for approval
– Work is conducted by specified project team, using
specified resources
– All Directors, Managers, and Advisors are volunteers
• Work is focused on coordination of projects, with
resources coming from other participating
organizations
The Problem Space
• The flows of data can be seen as forming a type
of “supply chain”
– Collected data are aggregated and
reported/disseminated to other organizations
– The points where data are exchanged can be
problematic:
• Loss of metadata
• No automated integration into receiving systems
• Time- and resource-intensive
• This exchange of data and metadata must be
managed in an efficient, standard fashion if we
are to build a global infrastructure
National Statistical
Organisations
accounts
statistics
Banks, Corporates
Individual Households
transactions
accounts
www.z.org
www.hub.org
www.y.org
www.x.org
Internet, Search, Navigation
180 + Countries
International Organisations accounts
Regional Organisations statistics
Data Lifecycle Model
• Within each level of the information chain, we see a
process:
–
–
–
–
Data sourcing or collection
Data processing (re-coding, harmonization, aggregation)
Data dissemination and archiving
Data reporting and re-purposing
• Throughout this cycle, each step generates important
metadata which can be captured to provide better
downstream processing and understanding of the data
• Today, this metadata is often lost
– Between steps of the lifecycle
– When the final data product is exchanged in the information
chain
Data Lifecycle Model
An Observation on Organizations
• Governmental, supra-governmental, and research
organizations which produce data have as a primary
mission the collection of data
– To support policy making
– To support research
– To support regulatory activities
• They do not have a primary mission to focus on the
exchange of data with other organizations
– This is often perceived as a burden rather than a part of the
primary mission of the organization
• They are often not well-skilled in the latest technology for
data exchange and interoperability
• Standards organizations tend to be too busy promoting
their own standards to be worried about how users might
combine them with other standards in implementations
Issues
• Issues with public data:
– Public data which is not released: "Users won't
understand it“ - Too little metadata!
– Public data which is unusable: formats are bad, too
little metadata about formats, terminology,
methodology, coding, and concepts
– Public data which cannot be accessed because its
location/existence is not known
– Public data which loses value because it cannot be
published and accessed in a timely manner
Issues (cont.)
• Issues with confidential data:
– Public data sets derived from confidential data have
been damaged by anonymization
– Confidential data which are not seen because access
produces unacceptable disclosure risk
• There are secure “Research Data Centers” for
allowing access to confidential data to qualified
researchers
– These are not as accessible or as open as they could
be, due to their physical nature and the fact that they
generally are not in communication with each other
– Better metadata management and shared metadata
leads to a better understanding of disclosure risk, and
thus improved access for researchers
Note on Data Confidentiality
• You might think proponents of Open Data would
disapprove of confidential data
– Response rates are falling for all types of survey data
collection due to fears of disclosure
– There are many new ways of collecting data about
individuals (RFID chips, security cameras, cell
phones, etc.)
– The standards for data confidentiality are there for a
good reason – to protect individuals!
• We believe that confidential data should be as
open as possible and not more!
Issues (cont.)
• Issues with data in the Developing World:
– Absent data due to inefficient or nonexistent data
collection/publication
– Unsustainable data collection/publication produces
insufficient continuity of data
• Once educated, IT workers get jobs in Europe and America
• Funding is typically not on-going, but only for a limited period
• The vast majority of the world’s population is in
the Developing World, and the trend is
increasing
– To understand our world and make good policy, we
must support sustainable data collection and
publication about this huge segment of the
population!
How Can We Solve These
Problems?
• Many of these issues can be solved with modern
technology
– Better documentation using standard metadata
formats
– Better mechanisms for data discovery and access
between organizations of all types
– Better mechanisms for managing semantic
interoperability
– Free or inexpensive tools for metadata capture and
data/metadata exchange
– Improved mechanisms for sustainable collection and
publication of data in the Developing World
ODaF’s Vision
• A network of standard, federated registries provide the
ability to discover data and metadata globally
• Standard data and metadata formats and models
provide the basis for automated use and integration
between applications
• Standard semantic registries and mappings to standard
classifications/ontologies allow for semantic
interoperability
• All of these standards would be coordinated to work
together predictably in an open architecture
• Domains are self-governing – each has its own
registries, classifications, etc. There must be minimum
governance at the center for operation of the entire
network.
– Interoperability through mapping to the standards-based open
architecture
Which Standards?
• ISO 17369 Statistical Data and Metadata
Exchange (SDMX)
• Data Documentation Initiative (DDI)
• ISO/IEC 11179 Metadata Registries
• ISO 19115 Digital Geographic Data
• Metadata Encoding and Transmission Standard
(METS)
• Extensible Business Reporting Language
(XBRL)
• Many others (SOA, ebXML, Web Services,
Semantic Web, Dublin Core)
ISO 17369 SDMX
• Produced by official statistics organizations (BIS, ECB,
Eurostat, IMF, OECD, World Bank, UN/SD)
• Now available as a 2.0 version
– Supports all aggregate data & time-series
– Supports all types of metadata (structural & “reference”
metadata)
– Provides standard registry interfaces for data sourcing and
exchange (not specific to SDMX formats)
• Based on a formal meta-model (similar to OMG’s
Common Warehouse Metamodel, but more focused)
• Data and metadata formats and classifications are
completely configurable
• Also provides recommendations for concepts, codes,
and classifications for official statistics
Data Documentation Initiative (DDI)
• Produced by a consortium of members (data archives
and libraries, national statistical organizations,
universities, etc.)
• Now in 3.0 candidate version which supports full data
lifecycle (release Q1 2008)
• Fine-grained metadata for describing:
–
–
–
–
–
Data collection (surveys, registers, etc.)
Data processing (for recodes, harmonization, data comparison)
Data archiving and dissemination
Data can be stored inline or in native file formats
Supports microdata and n-dimensional cubes
• Aligned with SDMX, ISO/IEC 11179, METS, ISO 19115,
and Dublin Core
ISO/IEC 11179 Metadata Registries
• Model for managing semantics of a data
dictionary and the lifecycle of
concepts/terms
• There is a separate ISO specification
under development for providing bindings
in XML, C, and other languages
• In widespread use in many other
standards, as well as for terminology
management within large organizations
ISO 19115 Digital Geographic Data
• Provides the standard metadata model for
describing geographies
• Implemented in several XML standards,
including DDI (there is also a standard ISO
XML)
• Well-accepted within the technology
community and among communities of
use (geographers, etc.)
METS
• A packaging standard for digital
libraries/archives
– Pulls together associated sets of files and establishes
their relationship to one another
• Can carry metadata payloads in their native XML
namespaces as “metadata sections”
• Cooperatively developed with DDI
– METS left the description of data to DDI
– DDI supports METS for archival packaging
XBRL
• XML standard from the accounting world
for describing business reports
• Widely used by banking supervisory
organizations
– Major source of financial statistics
• Well marketed and widely supported
• Ongoing alignment project with SDMX
ODaF Vision - Standards
Federated Registries (Based on SDMX, ebXML, web services)
Semantic
definitions
ISO
11179
Aggregated
Data/Metadata
(SDMX)
Organized
using
registered
References to source data
METS Packaging
Standard
classifications
XBRL
Business
Reports
DDI
Microdata
Sets
Used in
Dublin Core
Citations
ISO 19115
Geographies
ODaF Activities
• We are early in our efforts to create such
an infrastructure
– To establish a sufficient set of well-aligned
standards
– To build open-source tools to support the use
of these standards
– To otherwise support the adoption and use of
standard models, formats, and registries
ODaF Projects
• Standards Alignment Project: on-going effort to
establish an agreed mapping between the mentioned
standards
• SDMX Registry Hosting: Host SDMX registries on our
own servers for those wishing to do prototype
implementations
• DDI Development Support: provide hosting and
infrastructure to support the use and development of DDI
3.0
• DDI Foundation Tools Program: providing technical
coordination and infrastructure for a multi-institution effort
to build an Eclipse-based open-source toolkit for working
with DDI 3.0, including transforms to/from SAS, SPSS,
and STATA
• SDMX Browser: Developing an open-source tool (using
Adobe Flex) for collecting, updating, and viewing statistical
data in SDMX format – working in informal collaboration
with ECB and OECD
ODaF Project (cont.)
• DeXtris Browser: beta end-user tool for viewing and
searching DDI 1/2.* and 3.0 metadata files – supports
version transformations
• UKDA QuDEX Draft Standard: Working as technical
support for UKDA in their development of a standard for
qualitative metadata (may become part of DDI)
• Canadian RDC Network: Providing technical advice to
the Canadian RDC network on metadata management
and implementation in support of DDI 3.0.
• NORC Virtual Data Enclave: Working to help develop
and deploy the first “virtual” RDC in the US with data
from NIST, others
• Also involved in proposals to build a European “virtual”
RDC
ODaF Projects (cont.)
• Have contributed to the creation of training
materials and online support for DDI 3.0, for
general use
• White papers: DDI & SDMX (a comparison),
guidelines for open-source tools development,
others
• Member, DDI Alliance
• Sponsored IASSIST 2007 in Montreal (planned
also for IASSIST 2008 in Palo Alto, CA)
ODaF - Where We Are Today
• New organization, lots of interest and
support thus far
• Interesting projects are emerging, some
early deliverables have been finished
• Looking for participation from interested,
serious individuals
• Still at the stage of supporting and
promoting a coordinated set of standards
To Learn More…
•
•
•
•
•
•
ODaF: www.opendatafoundation.org
SDMX: www.sdmx.org
DDI: www.ddialliance.org
ISO/IEC 11179: http://metadata-stds.org/11179/
METS: http://www.loc.gov/standards/mets/
ISO 19115:
http://www.iso.org/iso/iso_catalogue/catalogue_t
c/catalogue_detail.htm?csnumber=26020
• XBRL: http://www.xbrl.org/Home/
Tools and Training
• For some free SDMX tools,
implementation support site, and SDMX
and DDI training courses:
www.metadatatechnology.com
Questions?