big data - Unidata
Download
Report
Transcript big data - Unidata
7 +/- 2
Maybe Good Ideas
John Caron
June 2011
(1)
• NetCDF-Java (aka CDM) has lots of functionality, but
only available in Java
–
–
–
–
NcML Aggregation
Access to lots of other file formats
Feature types (eg collections of point data)
Ironically, some functionality (eg aggregation) already
available for remote datasets through opendap
– But not for local datasets
How can we get the CDM into other languages ?
– Replicate in C and maintain two software stacks
– Use reverse JNI (call Java from C)
– Or …
CdmRemote Server (aka TDS Lite)
• Lightweight server for CDM datasets
– Zero configuration – use queries to configure
– Local filesystem
– Cache expensive objects
– Allow non-Java applications access to CDM stack
– Create virtual datasets: aggregations, logical views
– Coordinate space queries
– Feature Type subsetting
– New API (!)
CdmRemote Server (aka TDS Lite)
Python / ?
cdmRemote Server
C Client
CDM Point
Feature API
cdmRemote
CDM Point
Feature API
Application
Coordinate Systems
Data Access
Data
(2)
Ncstream as a netCDF file format
•
•
•
•
•
Write-optimized
Append only
Encode the full CDM object model
Uses Google’s protobuf for serialization
Java, C Libraries can read and access through
the standard netCDF API
• Tools to convert to netcdf-3 and 4 formats
(3) BUFR/GRIB Table registration
• Unidata sponsored web service
• Registered users can upload BUFR/GRIB tables
– Unique id is assigned (MD5 16 byte checksum?)
– Convince producers to include the id into the data –
unambiguous which table was used
– Anyone can download.
• GRIB and BUFR Decoding
–
–
–
–
Using CDM – find bugs !
Might become (ad-hoc) reference library
Might spur objections from “the experts”
Turn over to WMO if they want it
• Survival of Human Race is at stake here
(4)
Streaming data / standing queries
• The proposal Dennis and I submitted last year
• “As soon as it arrives on IDD, send me
PrecipTotal from NCEP/ RUC2 model subsetted
by lat/lon bounding box in netCDF-4 / CF
format”
• “As it arrives, send me GTS BUFR data in
lat/lon bounding box in CSV”
Current IDD data access
IDD Data
Push (header)
LDM
FILE
Pull requests
TDS
CDM
library
Dataset
Dataset
Dataset
Dataset
Content based filtering
(standing requests)
IDD Data
LDM
PIPE
Message Service
•Content filtering
•Change encoding
•Protocol?
Push (content)
Request
Request
Request
Content
Filter
Standing request
service
(5) Python
• Unidata should choose a scripting language to
support, and give scientists full access to all of
our tools in it
• Python wants to be the open-source Matlab
• DOE, BADC have bought into Python
• Python is a safe choice
(6) NetCDF management tools
• Develop consistent set of tools for managing
collections of netCDF files
– Use existing tools (ncgen, nccopy, ncdump, nco,
etc) under the covers
– but don’t be constrained by their interfaces
• Look at RDBMS management languages
• Use a scripting language like Python
(7) Hadoop
–
–
–
–
–
–
–
–
–
Open Source started by Doug Cutting (Lucene) and Yahoo
Based on Google’s Map-Reduce for parallel processing
Lots of industry use, part of new data ecosystems
Objects in distributed, replicated file system
Commodity, shared-nothing hardware nodes
Simple key-value store
Append-only, sequential reading
Scale to arbitrarily large amount of data (batch)
Gather many queries and run them over the data
(8) SciDB
• Michael Stonebraker, David DeWitt
– “SciDB will be optimized for data management of big data and for big
analytics.
– “The scientists that are participating in our open source project
believe that the SciDB database — when completed — will
dramatically impact their ability to conduct their experiments faster
and more efficiently and further improve the quality of life on our
planet by enabling them to run experiments that were previously
impossible due to the limitations of existing database systems and
infrastructure.”
• Getting involved:
1.
2.
Load netcdf/hdf5 into SciDB
“Native mode” – leave data in netcdf/hdf5