Transcript Slide 1
Large-Scale Data Management Challenges Federating Climate, Water, and Weather Data Repository/Workspace Workshop 20-21 September 2010 Kenneth Galluppi Director, Disaster and Environmental Programs Renaissance Computing Institute University of North Carolina at Chapel Hill 1 Outline • Environmental Problem • Use Case – Climate and Weather – Hydrology • Data Grid/Workspace Use Cases • Answer Peter’s Questions Environmental Science Problems • Enable cutting edge, Grand Challenge multidisciplinary science through the federation of data-grids of climate, water, and weather data, with other geospatially and socially relevant datasets. – Understanding of regional impacts of climate change on water availability and society trends – Understanding and prediction of catastrophic weatherdriven events under climate change – Communicate risk/crisis knowledge non-specialists 3 Challenges of Data • Integration of Large, Multidisciplinary Datasets – NCDC and NOAA Centers, SDSC, and others – Discover, access, integration, utility [not store/retrieve] • Linkage of Datasets to Computational Models – Input/outputs for real-time model forecasting – Model-to-observation comparison – Climatic models for reanalysis and prediction • Access to Large Reference Data – Climate Reanalysis Datasets, 1 PetaByte – NWS DataCube for aviation4 and emergencies Collaboration and Datagrids Academic Research Federal Agencies National Climatic Data Center 140 universites Research Program Emergency Management 5 Data Supports NOAA/NCDC Mission National Climatic Data Center NOAA Mission: To understand and predict changes in Earth’s environment and conserve and manage coastal and marine resources to meet our nation’s economic, social, and environmental needs NOAA Goals: Climate Understand Climate Variability and Change to Enhance Society’s Ability to Plan and Respond Weather & Water Serve Society’s Needs for Weather and Water Information Commerce & Transportation Support the Nation’s Commerce with Information for Safe, Efficient, and Environmentally Sound Transportation Ecosystems Protect, Restore, and Manage the Use of Coastal and Ocean Resources through an Ecosystem Approach to Management Mission Support Provide Critical Support for NOAA’s Mission The National Environmental Data Archive Climate Analysis RADAR Satellite Other The National Environmental Data Archive Comprehensive Large Array-data Stewardship System (CLASS) Storage (reanalysis) NOAA CLASS • Large Structured data • Propriety • Doesn’t interface with HPSS • Climate Support of products and services • Does well, what it does NOAA’s Data Centers Will Function in a Wider Information Landscape NCDC NGDC NSOF NODC NOAA’s Data Centers Will Function in a Wider Information Landscape NOAA’s Data Centers Will Function in a Wider Information Landscape NSF DataNet ORNL, ESG DAPs Data Mgmt NEAAT IPCC International Sources Climate Services using Federated DB’s NOAA’s Data Centers will need to provide access to petabytes of data that are distributed across multiple NOAA facilities Be able to integrate these data with data from other disciplines (environmental, biological, social, etc..) that are distributed on other databases both in the public and private sector domain Export data to common data formats - Shapefile, Well-Known Text, Arc/Info ASCII GRID, Gridded and Raw NetCDF, GeoTIFF and KMZ (Google Earth) Discipline-Specific View Support : Disaster reduction Human Health Climate Water Resources Weather Ocean Resources Agriculture & Land-Use Ecosystems Whole-System View Atmospheric Observations Land Surface Observation Ocean Observations Space Observations Data Systems Current systems are program-specific, focused, individually efficient. But incompatible, not integrated, isolated from one another and from wider environmental community Coordinated, efficient, integrated, interoperable NOAA/NCDC Climate Services Data supports NOAA/NCDC Mission • NCDC will need to function in a wider information landscape with a NOAA Federated Archive (6 data centers) – Support distributed data management and services • Interoperable with DataNet, Earth System Grid, GEO-IDE, EOSDIS, etc. – netCDF, LDM, CF conventions, ISO 19115-2 • Move out of the Box and into the Cloud (networked) – Utilize highly distributed storage and computing (RENCI, Oak Ridge National Lab • Implement supporting technologies to enable interoperability with Designated Communities (OGC, WMS/WFS) • Institute rules-based data management to enable true federation of NOAA Centers of Data – iRODS NCDC-RENCI Potential Use Cases • Catastrophic Event Modeling and Observations • Climate Reanalysis Datasets – – – – Climate records everywhere, for 30 years 1-PetaByte Regional and local sub-setting Ten’s of thousands of users • Multi-sensed Gridded Precipitation Climatology • Extreme Event Climatology • Green Energy, physical-social science Integration 16 Hydrology Community High Level View of HIS Service Oriented Architecture As of October 2009, 1,867,108 sites and 4,336,790,286 data values where available through the HIS from federal, state, and academic data providers. There have been 543,144 “GetValues” data requests from Feb 2008 to Oct 2009 . http://his.cuahsi.org 17 CUAHSI HIS The CUAHSI Hydrologic Information System (HIS) is an internet based system to support the sharing of hydrologic data. It is comprised of hydrologic databases and servers connected through web services as well as software for data publication, discovery and access. HIS Central Data Discovery and Integration platform HydroServer Data Publication platform Like web servers Data Services Water Data Services Spatial Data Services Like HTML Like search portals Google, Yahoo, Bing HydroDesktop Data Synthesis and Research platform Like browsers HIS Service Oriented Architecture 19 Publication of Point Observations • Observations Data Model (ODM) – – – – ODM Tools ODM Data Loader ODM Streaming Data Loader ODM Controlled Vocabularies • WaterOneFlow web services – Data are transmitted in WaterML format Dynamic Controlled Vocabulary Moderation System ODM Data Manager HIS CV Website ODM Tools XML Local ODM Database Local Server ODM Controlled Vocabulary Moderator ODM Controlled Vocabulary Web Services http://his.cuahsi.org/mastercvreg.html Master ODM Controlled Vocabulary CUAHSI Observations Data Model http://his.cuahsi.org/odmdatabases.html Horsburgh, J. S., D. G. Tarboton, D. R. Maidment and I. Zaslavsky, (2008), A Relational Model for Environmental and Water Resources Data, Water Resour. Res., 44: W05406, doi:10.1029/2007WR006392. Maximize Data Access and Utility Data and Model Integration Needed to Support Hydrologic Science DFC Observations Hydrologic Models Physical Data Weather and Climate Models Socioeconomic Data CUAHSI HIS 24 Capabilities Database Configuration Tool WaterOneFlow ODM WaterOneFlow Services WaterOneFlow ODM HydroServer Capabilities Database Spatial Services WaterOneFlow ODM ODM Databases and Web ArcGIS Server Spatial Data HydroServer Distributed Water Data Services National Dataset Water Data Services ODM ODM WOF ODM WOF WOF ODM ODM WOF NWIS WOF STORET WOF DAYMET WOF HydroDesktop SNOTEL WOF HIS Central Desktop Analysis Software R MATLAB Excel Workflow Desktop Data Repository Database API Web Service Calls WaterML Search, Download, and Manage Data Subscriptions Visualize and Summarize (TSA) Data Discover y Calls Metadata Services … WOF WOF Metadata Harvesting Metadata Catalog Variable Mapping Convert Units Convert Formats and Export Ontology Services Files Import HydroDesktop Plug-ins Files Ontology HydroServer Implementation in WATERS Network Information System National Hydrologic Information Server San Diego Supercomputer Center • 11 WATERS Network test bed projects • 16 ODM instances (some test beds have more than one ODM instance) • Data from 1246 sites, of these, 167 sites are operated by WATERS investigators Meteorology, Hydrology, Ecological Models Scientific Research Historical Re-Analysis TOPS TOPS Disaster Planning ADAS WRF CHPS RHESSys RHESSYS HEC-RAS Disaster Response Agricultural Forecasts Ag Decision Support ADCIRC ADCIRC Public Dissemination Economic Planning etc … Sensor Data Bus Sensor Cloud • National Weather Service • Department of Transportation / FAA • USGS NWIS, USFS • Buoys, Stream Gauges, Soil Moisture • People with mobile devices • etc … State Climate Office Enablement Use Case: National Water Model Hydrologic scientist have expressed a “grand research challenge” of building a National Water Model for flood and drought applications. Flooding in the Mississippi River Basin, August 1993 observed from satellite imagery Terrain in the Neuse River Basin, NC constructed from 390 million LiDAR measurements Source: terrain.cs.duke.edu Achieving this goal will require a system like DFC to handle the massive data requirements. 29 Source: nasa.gov CUAHSI Case Study • Hydrology Grand Challenge Problem: National Water Model – How much water is available in the Nation’s water resources? – Currently, hydrologic models are implemented at the watershed-scale (county) – Hydrologists plan to scale physically-based models to national level • Provide CI, Policies & Sustainability for Water Model Data – Gathering, analysis, dissemination and preservation – Policies for quality control, metadata harvesting, versioning and usage – Enables the data required for real-time analysis for flood and drought modeling – Enables integrating data from “new sources” – Enables new science, outreach, decision making and disaster recovery – Integration of Predictive Models, Real-time Data and Historic Data 30 • Technical Solutions – Too many systems/solutions, home grown to programs (CUAHSI) – Standards (ODM, OGC, Virtual USA, etc.) – Federal enterprises – NOAA, CLASS general, heavy system – Oracle front end to large tape system • Unique • Handling large sets with limited skills • Multidisciplinary, formats are not enough, but knowledge • Federal – Has to work, has to preserve – Observation systems are getting more complex – Users are more sophisticated and demanding more Large Storage Systems Compute and Servers DataNet Data Management, Data Grid Testbed Data Management Firewall Security Workflow iRODS Data Manage HPCC Compute Diversity in the Landscape • Data grids to include generic data management infrastructure – – – – – Data sharing Digital libraries, publish and discovery Persistent archives for preservation Data processing pipelines Virtualize data collections • • • • • File systems Tape archives Cloud storage Institutional repositories Digital repositories Diversity in the Landscape • Policy-based Data Management – Each center has same management needs but implement different policies and procedures – Implement their own policies but leverage standard data management – Interoperate with other repositories through specific drivers that implement protocol • Integrated Rule Oriented Data System (iRODS) How to Federate? Users, services and local storage • Clients – present information in context – User level file systems – Web browsers – Web services • Workflow – manage processing steps • Data grid – access to the repositories – Uniform name space – Properties (meta) and access (time stamp, version) – Policies – retention, disposition, authenticity, QA • Storage Systems – tapes, file system, cloud Safe Replication • Repositories must be replicated • Data grids are good at this – Making copies – Keeping track of copies – Integrity of copies – Disposition of copies (rules for retention and checking) Policy Rules for Control • Actions that simplify use of data – Data sharing: access control, distribution, organizing – Publishing: Descriptive metadata, integrity, replication – Data preservation: retention, disposition, trust, ownership • Data ingestion, storage, and access control User Workspaces • Needed for interim data products • Track operations performed on the data – Same needs as repositories, only shorter timeframe – Individual, organization, operation processing Processing and workspaces • Process of petabytes collections and distributed processing • Process at local storage if simple processing • Move file is processing is complex or demanding. • Data management views processing transparently and facilitates: – Move files – Manage processing and workspace Frameworks for distributed processing • iRODS – integrated Rule Oriented Data System – Internal workflows (rules of microservices) – External workflows (Taverna, Kepler, Pegasus) – Data management decoupled from workflows and both can be distributed • Data interchange with workflow – – – – – Parameter passing (microservice) In-memory structures (workflow and microservice) In-memory, but distrubuted Shared metadata, retrieved out of catalog Shared files