OPeNDAP: Accessing Data in a Distributed, Heterogeneous Environment Peter Cornillon Graduate School of Oceanography University of Rhode Island Presented at the NSF Sponsored Cyberinfrastructure Meeting 31 October 2002

Download Report

Transcript OPeNDAP: Accessing Data in a Distributed, Heterogeneous Environment Peter Cornillon Graduate School of Oceanography University of Rhode Island Presented at the NSF Sponsored Cyberinfrastructure Meeting 31 October 2002

OPeNDAP: Accessing Data in a
Distributed, Heterogeneous
Environment
Peter Cornillon
Graduate School of Oceanography
University of Rhode Island
Presented at the NSF Sponsored
Cyberinfrastructure Meeting
31 October 2002
Outline
 DODS NVODS & OPeNDAP
 Interoperability: The Core Infrastructure
 How OPeNDAP is being used
 Lessons learned – also throughout
Distributed Oceanographic Data System
(DODS)
Conceived in 1992 at a workshop held at URI.
 Objectives were:
– to facilitate access to PI held data as well as data
held in national archives and
– to allow the data user to analyze data using the
application package with which he or she is the
most familiar.
 Basic system designed and implemented in
1993-1994 by Gallagher and Flierl
Distributed Oceanographic Data System
DODS consisted of two fundamental parts:
 a discipline independent core infrastructure
for moving data on the net,
 a discipline specific portion related to data –
population, location, specialized clients, etc.
DODS  OPeNDAP & NVODS
To isolate the discipline independent part of the
system from the discipline specific part, two
entities have been formed:
 Open Source Project for a Network Data
Access Protocol (OPeNDAP)
 National Virtual Ocean Data System
(NVODS)
DODS  NVODS/OPeNDAP
 OPeNDAP was formed to maintain and
evolve the DODS core infrastructure
 OPeNDAP is a non-profit corporation
 OPeNDAP focuses on the discipline neutral
parts of the DODS data access protocol
Objective of OPeNDAP
 To provide a data access protocol allowing
for machine-to-machine interoperability
with semantic meaning in a distributed,
heterogeneous data environment
The scripted exchange of data between
computers, without human intervention.
Considerations with regard to the
development OPeNDAP
 Many data providers
 Many data formats
 Many different client types
 Many different semantic representations of
the data
The Core Infrastructure
Interoperability
Interoperability - Metadata
The degree to which machine-to-machine
interoperability is achieved depends on the
metadata associated with the data.
OPeNDAP and Metadata
Metadata Types
We define two classes of metadata:
• Search metadata – used to locate data sets
of interest in a distributed data system.
• Use metadata –needed to actually use
the data.
Use Metadata
We divide use metadata into two classes:
• Syntactic use metadata
• Semantic use metadata
Syntactic Use Metadata
Information about the data types and structures
at the computer level - the syntax of the data;
– e.g., variable T represents a 20x40 element floating
point array.
Semantic Use Metadata
Information about the contents of the data set.
e.g., variable T represents
• sea surface temperature
• with units of ºC
Semantic Use Metadata
We divide semantic use metadata into two classes:
• Translational Semantic Use Metadata
• Descriptive Semantic Use Metadata
Translational Semantic Use Metadata
 Metadata required to make use of the data;
e.g., to properly label a plot of the data
 Define the translation from received values to
semantically meaningful values
 Examples
• Variable names in the data set: t  SST
• Units of the data: 0.125 C+4  C
• Missing value flags: -99  missing value
OPeNDAP and Metadata
Interoperability – Data Exchange
Interoperability may be defined at any one of
a number of levels ranging from:
 the lowest (hardware) - how computers are
linked electronically, to
 the highest – semantically meaningful,
machine-to-machine exchanges.
Organizational Complexity
Example: Consider the different ways of organizing a
multi-year data set consisting of one global sea
surface temperature (SST) field per day:
 one 2-d file per day sst(lat,lon) - URI
 one 3-d file sst(lon,lat,time) - PMEL
 one file per year with one variable per day 365
variables per file, n files for n year - GSFC
Structure Layer
 Provide the capability to reorganize data so
that they are in a consistent structural form.
 Objective is to reduce the granularity of the
data set
 Example: one 3-d file sst(lon,lat,time)
Format Layer
 Format transformation only between server
and client
Data values are not modified
 The organizational structure of the data is
not modified
Structure Layer
 The organizational structure of the data is
modified
Data values are not modified
An OPeNDAP Structural Layer
Component – The Aggregation Server
 Developed by John Caron of Unidata
 Is for the aggregation of grids and arrays only
 Operates in the Syntactic Structural Level
OPeNDAP - NVODS
Status
OPeNDAP Server
SitesSites
OPeNDAP/NVODS
Server
OPeNDAP Client and Server Status
Special Servers
Projects Using OPeNDAP
 GODAE (Global Ocean Data Assimilation Experiment)
 NOMADS (NOAA Operational Model Archive and
Distribution System)
 AOIMPS
 ESG II - Earth System Grid II
 Ocean. US (US-GOOS)
 High Altitude Observatory Community
Institutions Making Heavy use
of OPeNDAP2
 Ingrid - Columbia University
 COLA - Center for Ocean-Land-Atmosphere
 Goddard DAAC
 CDC - Climate Diagnostic Center
 PMEL - Pacific Marine Environment Lab
OPeNDAP Monthly Accesses (2002)
Site/Month
April
May
June
July
August
URI
4,856
19,504
3,691
26,693
7,440
LDEO
80,709
62,930
46,092
93,088
32,084
CDC
102,518
153,362
62,395
181,974
107,512
JPL
3,068
34,028
63,309
8,260
13,282
COLA
347,506
412,991
337,310
400,314
638,376
TOTAL 535,589 648,787 502,797 702,069 785,412
OPeNDAP Unique Users
(2002)
Site
April
May
June
July
August
URI
73
68
72
44
69
CDC
124
105
91
116
111
JPL
122
152
173
198
174
COLA
158
199
317
197
408
Interesting OPeNDAP Access Statistics
• IRI data accesses for 1st quarter of 2002
Type
Requests
%
Volume (gb)
%
OPeNDAP
191,611
8.5
375.2
69.4
Other
2,062,681
91.5
165.2
30.6
Total
2,254,292
100.0
540.4
100.0
•PMEL OPeNDAP2 ~ 35,000 with ~26,000
internal.
Lessons (Re)Learned
Lessons (Re)Learned
1. Modularity provides for flexibility
The more modular the underlying
infrastructure the more flexible the system.
This is particularly important for network
based systems for which the technology,
software and hardware, are changing rapidly.
Lessons (Re)Learned
2. Data of interest will be stored in a variety of
formats.
Regardless of how much one might want to
define the format to be used by system
participants, in the end the data will be stored
in a variety of formats.
2a. The same is true of translational use
metadata!
Lessons Learned
3. Structural representation of sequence data
sets is a major obstacle to interoperability
Care must be given to the organizational
structure (as opposed to the format) of the data.
This is the single largest constraint to the use of
profile data in NVODS.
Lessons (Re)Learned
4. “Not invented here”
Avoid the “not invented here” trap. The basic
concepts of a data system are relatively
straightforward to define. Implementing these
concepts ALWAYS involves substantially more
work than originally anticipated. The “Devil’s in
the details”.
Take advantage of existing software wherever
possible.
Lessons (Re)Learned
5. Work with those who adopt the system for
their own needs.
Take advantage of those who are interested in
contributing to the system because the system
addresses their needs as opposed to those who
are simply doing the work for the associated
funding. => Open source.
Lessons Learned
6. There is no well defined funding structure
for community based operational systems.
It is much easier to obtain funding to develop
a system than it is to obtain funding to
maintain and evolve a system.
This is a major obstacle to development of a
stable cyberinfrastructure that meets the
needs of the research community.
Lessons Learned
7. It is relatively more difficult to obtain
funding for applied system development
than for research related to data systems.
This is another obstacle to the development
of cyberinfrastructure that meets the needs
of the research community.
Lessons (Re)Learned
8. “Tough to teach old dogs new tricks”
Introducing new technology often
requires a cultural change in usage that
is difficult to effect. This can negatively
effect system development.
Lesser Lessons Learned
9. Some surprises encountered in the NVODS/
OPeNDAP effort
 Heavy within organization usage.
 Metadata focus in the past is appropriate for interoperability at
the data level.
 Number of variables increases almost linearly with the number
of data sets.
 Users will take advantage of all of the flexibility offered by a
system sometimes to the disadvantage of all.
 Incredible variability in the structural organization of data.
Lessons Learned
10. Metrics suggest
 Increasing use of scripted requests
 Large volume transfers
As data systems offering machine-tomachine interoperability with semantic
meaning take hold, we could well see an
explosive growth in the use of the web.
Lessons Learned
11. Time to maturity is order 10 years not 3
Developing new infrastructure takes time,
both to iron out all of the %^*% little details
and adoption of the infrastructure takes time.
Peter’s Law
The more metadata required the
less data delivered
Of course, the less metadata, the
harder it is to use the data
http://unidata.ucar.edu/packages/dods
http://nvods.org