Transcript Slide 1

Large-Scale Data Management Challenges
Federating
Climate, Water, and Weather Data
Repository/Workspace Workshop
20-21 September 2010
Kenneth Galluppi
Director, Disaster and Environmental Programs
Renaissance Computing Institute
University of North Carolina at Chapel Hill
1
Outline
• Environmental Problem
• Use Case
– Climate and Weather
– Hydrology
• Data Grid/Workspace Use Cases
• Answer Peter’s Questions
Environmental Science Problems
• Enable cutting edge, Grand Challenge
multidisciplinary science through the federation of
data-grids of climate, water, and weather data, with
other geospatially and socially relevant datasets.
– Understanding of regional impacts of climate change on
water availability and society trends
– Understanding and prediction of catastrophic weatherdriven events under climate change
– Communicate risk/crisis knowledge non-specialists
3
Challenges of Data
• Integration of Large, Multidisciplinary
Datasets
– NCDC and NOAA Centers, SDSC, and others
– Discover, access, integration, utility [not store/retrieve]
• Linkage of Datasets to Computational Models
– Input/outputs for real-time model forecasting
– Model-to-observation comparison
– Climatic models for reanalysis and prediction
• Access to Large Reference Data
– Climate Reanalysis Datasets, 1 PetaByte
– NWS DataCube for aviation4 and emergencies
Collaboration and Datagrids
Academic Research
Federal Agencies
National Climatic
Data Center
140 universites
Research Program
Emergency Management
5
Data Supports NOAA/NCDC Mission
National Climatic Data Center
NOAA Mission:
To understand and predict changes in Earth’s
environment and conserve and manage coastal and
marine resources to meet our nation’s economic,
social, and environmental needs
NOAA Goals:
Climate Understand Climate Variability and Change to Enhance Society’s Ability to Plan and Respond
Weather & Water Serve Society’s Needs for Weather and Water Information
Commerce & Transportation Support the Nation’s Commerce with Information for Safe,
Efficient, and Environmentally Sound Transportation
Ecosystems Protect, Restore, and Manage the Use of Coastal and Ocean Resources through an Ecosystem
Approach to Management
Mission Support Provide Critical Support for NOAA’s Mission
The National Environmental Data Archive
Climate
Analysis
RADAR
Satellite
Other
The National Environmental Data Archive
Comprehensive Large
Array-data Stewardship
System (CLASS)
Storage
(reanalysis)
NOAA CLASS
• Large Structured data
• Propriety
• Doesn’t interface with
HPSS
• Climate Support of
products and services
• Does well, what it does
NOAA’s Data Centers Will Function in a
Wider Information Landscape
NCDC
NGDC
NSOF
NODC
NOAA’s Data Centers Will Function in a
Wider Information Landscape
NOAA’s Data Centers Will Function in a
Wider Information Landscape
NSF
DataNet
ORNL,
ESG
DAPs
Data Mgmt
NEAAT
IPCC
International
Sources
Climate Services using Federated DB’s
NOAA’s Data Centers will need to provide access to petabytes of data that are
distributed across multiple NOAA facilities
Be able to integrate these data with data from other disciplines (environmental,
biological, social, etc..) that are distributed on other databases both in the public
and private sector domain
Export data to common data formats - Shapefile, Well-Known Text, Arc/Info ASCII
GRID, Gridded and Raw NetCDF, GeoTIFF and KMZ (Google Earth)
Discipline-Specific View
Support :
Disaster reduction
Human Health
Climate
Water Resources
Weather
Ocean Resources
Agriculture & Land-Use
Ecosystems
Whole-System View
Atmospheric
Observations
Land Surface
Observation
Ocean
Observations
Space
Observations
Data Systems
Current systems are program-specific, focused,
individually efficient.
But incompatible, not integrated, isolated from one
another and from wider environmental community
Coordinated,
efficient,
integrated,
interoperable
NOAA/NCDC Climate Services
Data supports NOAA/NCDC Mission
• NCDC will need to function in a wider information landscape
with a NOAA Federated Archive (6 data centers)
– Support distributed data management and services
• Interoperable with DataNet, Earth System Grid, GEO-IDE,
EOSDIS, etc.
– netCDF, LDM, CF conventions, ISO 19115-2
• Move out of the Box and into the Cloud (networked)
– Utilize highly distributed storage and computing (RENCI, Oak Ridge
National Lab
• Implement supporting technologies to enable interoperability
with Designated Communities (OGC, WMS/WFS)
• Institute rules-based data management to enable true
federation of NOAA Centers of Data – iRODS
NCDC-RENCI Potential Use Cases
• Catastrophic Event Modeling and Observations
• Climate Reanalysis Datasets
–
–
–
–
Climate records everywhere, for 30 years
1-PetaByte
Regional and local sub-setting
Ten’s of thousands of users
• Multi-sensed Gridded Precipitation Climatology
• Extreme Event Climatology
• Green Energy, physical-social science Integration
16
Hydrology Community
High Level View of HIS Service Oriented Architecture
As of October 2009,
1,867,108 sites and
4,336,790,286 data values
where available through the
HIS from federal, state, and
academic data providers.
There have been 543,144
“GetValues” data requests
from Feb 2008 to Oct 2009 .
http://his.cuahsi.org
17
CUAHSI HIS
The CUAHSI Hydrologic Information System (HIS) is an internet based system
to support the sharing of hydrologic data. It is comprised of hydrologic
databases and servers connected through web services as well as software for
data publication, discovery and access.
HIS Central
Data Discovery and
Integration platform
HydroServer
Data Publication
platform
Like web
servers
Data Services
Water Data Services
Spatial Data Services
Like HTML
Like search portals
Google, Yahoo, Bing
HydroDesktop
Data Synthesis and
Research platform
Like browsers
HIS Service Oriented Architecture
19
Publication of Point Observations
• Observations Data Model
(ODM)
–
–
–
–
ODM Tools
ODM Data Loader
ODM Streaming Data Loader
ODM Controlled
Vocabularies
• WaterOneFlow web
services
– Data are transmitted in
WaterML format
Dynamic Controlled Vocabulary Moderation System
ODM Data
Manager
HIS CV
Website
ODM
Tools
XML
Local ODM
Database
Local
Server
ODM Controlled
Vocabulary Moderator
ODM
Controlled
Vocabulary
Web Services
http://his.cuahsi.org/mastercvreg.html
Master ODM
Controlled
Vocabulary
CUAHSI Observations Data Model http://his.cuahsi.org/odmdatabases.html
Horsburgh, J. S., D. G. Tarboton, D. R. Maidment and I. Zaslavsky, (2008), A Relational Model for
Environmental and Water Resources Data, Water Resour. Res., 44: W05406, doi:10.1029/2007WR006392.
Maximize Data Access and Utility
Data and Model Integration Needed
to Support Hydrologic Science
DFC
Observations
Hydrologic Models
Physical Data
Weather and Climate
Models
Socioeconomic Data
CUAHSI HIS
24
Capabilities
Database
Configuration
Tool
WaterOneFlow
ODM
WaterOneFlow
Services
WaterOneFlow
ODM
HydroServer
Capabilities
Database
Spatial
Services
WaterOneFlow
ODM
ODM Databases and Web
ArcGIS Server Spatial Data
HydroServer Distributed Water Data Services
National Dataset Water Data Services
ODM
ODM
WOF
ODM
WOF
WOF
ODM
ODM
WOF
NWIS
WOF
STORET
WOF
DAYMET
WOF
HydroDesktop
SNOTEL
WOF
HIS Central
Desktop
Analysis
Software
R
MATLAB
Excel
Workflow
Desktop
Data
Repository
Database API
Web Service Calls
WaterML
Search, Download,
and Manage Data
Subscriptions
Visualize and
Summarize (TSA)
Data
Discover
y
Calls
Metadata
Services
…
WOF
WOF
Metadata
Harvesting
Metadata
Catalog
Variable
Mapping
Convert Units
Convert Formats
and Export
Ontology
Services
Files
Import
HydroDesktop
Plug-ins
Files
Ontology
HydroServer Implementation in
WATERS Network Information System
National Hydrologic Information Server
San Diego Supercomputer Center
• 11 WATERS Network test bed projects
• 16 ODM instances (some test beds have more than one ODM
instance)
• Data from 1246 sites, of these, 167 sites are operated by WATERS
investigators
Meteorology, Hydrology, Ecological Models
Scientific Research
Historical Re-Analysis
TOPS
TOPS
Disaster Planning
ADAS
WRF
CHPS
RHESSys
RHESSYS
HEC-RAS
Disaster Response
Agricultural Forecasts
Ag Decision Support
ADCIRC
ADCIRC
Public Dissemination
Economic Planning
etc …
Sensor Data Bus
Sensor Cloud
• National Weather Service
• Department of Transportation / FAA
• USGS NWIS, USFS
• Buoys, Stream Gauges, Soil Moisture
• People with mobile devices
• etc …
State
Climate
Office
Enablement
Use Case: National Water Model
Hydrologic scientist have expressed a “grand research
challenge” of building a National Water Model for flood
and drought applications.
Flooding in the Mississippi River Basin, August
1993 observed from satellite imagery
Terrain in the Neuse River Basin, NC
constructed from 390 million LiDAR
measurements
Source: terrain.cs.duke.edu
Achieving this goal will require a system like DFC to
handle the massive data requirements.
29
Source: nasa.gov
CUAHSI Case Study
• Hydrology Grand Challenge Problem: National Water
Model
– How much water is available in the Nation’s water resources?
– Currently, hydrologic models are implemented at the watershed-scale
(county)
– Hydrologists plan to scale physically-based models to national level
• Provide CI, Policies & Sustainability for Water Model Data
– Gathering, analysis, dissemination and preservation
– Policies for quality control, metadata harvesting, versioning and usage
– Enables the data required for real-time analysis for flood and drought
modeling
– Enables integrating data from “new sources”
– Enables new science, outreach, decision making and disaster recovery
– Integration of Predictive Models, Real-time Data and Historic Data
30
• Technical Solutions
– Too many systems/solutions, home grown to programs
(CUAHSI)
– Standards (ODM, OGC, Virtual USA, etc.)
– Federal enterprises
– NOAA, CLASS general, heavy system
– Oracle front end to large tape system
• Unique
• Handling large sets with limited skills
• Multidisciplinary, formats are not enough, but
knowledge
• Federal
– Has to work, has to preserve
– Observation systems are getting more complex
– Users are more sophisticated and demanding more
Large
Storage
Systems
Compute and
Servers
DataNet
Data Management,
Data Grid Testbed
Data Management
Firewall Security
Workflow
iRODS
Data Manage
HPCC Compute
Diversity in the Landscape
• Data grids to include generic data management infrastructure
–
–
–
–
–
Data sharing
Digital libraries, publish and discovery
Persistent archives for preservation
Data processing pipelines
Virtualize data collections
•
•
•
•
•
File systems
Tape archives
Cloud storage
Institutional repositories
Digital repositories
Diversity in the Landscape
• Policy-based Data Management
– Each center has same management needs but
implement different policies and procedures
– Implement their own policies but leverage
standard data management
– Interoperate with other repositories through
specific drivers that implement protocol
• Integrated Rule Oriented Data System (iRODS)
How to Federate?
Users, services and local storage
• Clients – present information in context
– User level file systems
– Web browsers
– Web services
• Workflow – manage processing steps
• Data grid – access to the repositories
– Uniform name space
– Properties (meta) and access (time stamp, version)
– Policies – retention, disposition, authenticity, QA
• Storage Systems – tapes, file system, cloud
Safe Replication
• Repositories must be replicated
• Data grids are good at this
– Making copies
– Keeping track of copies
– Integrity of copies
– Disposition of copies (rules for retention and
checking)
Policy Rules for Control
• Actions that simplify use of data
– Data sharing: access control, distribution,
organizing
– Publishing: Descriptive metadata, integrity,
replication
– Data preservation: retention, disposition, trust,
ownership
• Data ingestion, storage, and access control
User Workspaces
• Needed for interim data products
• Track operations performed on the data
– Same needs as repositories, only shorter
timeframe
– Individual, organization, operation processing
Processing and workspaces
• Process of petabytes collections and
distributed processing
• Process at local storage if simple processing
• Move file is processing is complex or
demanding.
• Data management views processing
transparently and facilitates:
– Move files
– Manage processing and workspace
Frameworks for distributed processing
• iRODS – integrated Rule Oriented Data System
– Internal workflows (rules of microservices)
– External workflows (Taverna, Kepler, Pegasus)
– Data management decoupled from workflows and both can be
distributed
• Data interchange with workflow
–
–
–
–
–
Parameter passing (microservice)
In-memory structures (workflow and microservice)
In-memory, but distrubuted
Shared metadata, retrieved out of catalog
Shared files