The Earth System Grid (ESG) & The Community Data Portal (CDP) (NCAR’s Data & GriD Efforts) for COMMISSION FOR BASIC SYSTEMS INFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK.
Download ReportTranscript The Earth System Grid (ESG) & The Community Data Portal (CDP) (NCAR’s Data & GriD Efforts) for COMMISSION FOR BASIC SYSTEMS INFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK.
The Earth System Grid (ESG) & The Community Data Portal (CDP) (NCAR’s Data & GriD Efforts) for COMMISSION FOR BASIC SYSTEMS INFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK TEAM ON THE FUTURE WMO INFORMATION SYSTEM KUALA LUMPUR, 20 - 24 OCTOBER 2003 Courtesy: Don Middleton NCAR Scientific Computing Division NCAR “Atkins Report” “A new age has dawned…” “The Panel’s overarching recommendation is that the National Science Foundation should establish and lead a large-scale, interagency, and internationally coordinated Advanced Cyberinfrastructure Program (ACP) to create, deploy, and apply cyberinfrastructure in ways that radically empower all scientific and engineering research and allied education. We estimate that sustained new NSF funding of $1 billion per year is needed to achieve critical mass and to leverage the coordinated co-investment from other federal agencies, universities, industry, and international sources necessary to empower a revolution. The cost of not acting quickly or at a subcritical level could be high, both in opportunities lost and in increased fragmentation and balkanization of the research.” NCAR Atkins Report, Executive Summary The Earth System Grid http://www.earthsystemgrid.org U.S. DOE SciDAC funded R&D effort - a “Collaboratory Pilot Project” Build an “Earth System Grid” that enables management, discovery, distributed access, processing, & analysis of distributed terascale climate research data Build upon Globus Toolkit and DataGrid technologies and deploy (Rubber on the road) Potential broad application to other areas NCAR ESG Team Ian Foster (PI) Veronika Nefedova (John Bresenhan) (Bill Allcock) LBNL – Arie Shoshani – Alex Sim ORNL – David Bernholdte – Kasidit Chanchio – Line Pouchard NCAR LLNL/PCMDI – Bob Drach – Dean Williams (PI) ANL – – – – USC/ISI – Anne Chervenak – Carl Kesselman – (Laura Perlman) NCAR – – – – – – David Brown Luca Cinquini Peter Fox Jose Garcia Don Middleton (PI) Gary Strand NCAR Baseline Numbers T42 CCSM (current, 280km) – 7.5GB/yr, 100 years -> .75TB T85 CCSM (140km) – 29GB/yr, 100 years -> 2.9TB T170 CCSM (70km) – 110GB/yr, 100 years -> 11TB NCAR Capacity-related Improvements Increased turnaround, model development, ensemble of runs Increase by a factor of 10, linear data Current T42 CCSM – 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB NCAR Capability-related Improvements Spatial Resolution: T42 -> T85 -> T170 Increase by factor of ~ 10-20, linear data Temporal Resolution: Study diurnal cycle, 3 hour data Increase by factor of ~ 4, linear data CCM3 at T170 (70km) NCAR Capability-related Improvements Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice Increase by another factor of 2-3, data flat Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics), middle Atmosphere Model… Increase by another factor of 10+, linear data NCAR Model Improvement Wishlist Grand Total: Increase compute by a Factor O(100010000) NCAR ESG Scenario End 2002: 1.2 million files comprising ~75TB of data at NCAR, ORNL, LANL, NERSC, and PCMDI End 2007: As much as 3 PB (3,000 TB) of data (!) Current practice is already broken – the future will be even worse if something isn’t done… NCAR ESG Scenario (cont.) Data – Different formats are converted to netCDF – netCDF is not standardized to the CF model – Different sites require knowledge of different methods of access Metadata – Most kept in online files separate from data and unsearchable unless one is “in the know” – Some kept in people’s brains Access control – Manual – Not formalized Data requests – Beginnings of a formal process (e.g., the PCMDI model) – Beginnings of web portals – Far too much done by hand – Logging nearly non-existent NCAR ESG: Challenges Enabling the simulation and data management team Enabling the core research community in analyzing and visualizing results Enabling broad multidisciplinary communities to access simulation results We need integrated scientific work environments that enable smooth WORKFLOW for knowledge development: computation, collaboration & collaboratories, data management, access, distribution, analysis, and visualization. NCAR ESG: Strategies Move data a minimal amount, keep it close to computational point of origin when possible – Data access protocols, distributed analysis When we must move data, do it fast and with a minimum amount of human intervention – Storage Resource Management, fast networks Keep track of what we have, particularly what’s on deep storage – Metadata and Replica Catalogs Harness a federation of sites, web portals – Globus Toolkit -> The Earth System Grid -> The UltraDataGrid NCAR Storage/Data Management HRM Tera/Peta-scale Archive Tools for reliable staging, transport, and replication Server Selection Control Monitoring HRM Server Tera/Peta-scale Archive NCAR Client HRM HRM aka “DataMover” Running well across DOE/HPSS systems New component built that abstracts NCAR Mass Storage System Defining next generation of requirements with climate production group First “real” usage “The bottom line is that it now works fine and is over 100 times faster than what I was doing before. As important as two orders of magnitude increase in throughput is, more importantly I can see a path that will essentially reduce my own time spent on file transfers to zero in the development of the climate model database” – Mike Wehner, LBNL NCAR OPeNDAP An Open Source Project for a Network Data Access Protocol (originally DODS, the Distributed Oceanographic Data System) NCAR Distributed Data Access Services Typical Application OPeNDAP-g -Transparency -Performance -Security -Authorization -(Processing) Distributed Application Application Application Application netCDF lib OPeNDAP Client ESG client data Data (local) NCAR OPeNDAP Via http OPeNDAP Via Grid ESG + DODS OpenDAP Server ESG Server Data (remote) Big Data (Multiple remotes) ESG: NcML Core Schema For XML encoding of metadata (and data) of any generic netCDF file Objects: netCDF, dimension, variable, attribute Beta version reference implementation as Java Library (http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm) nc:netCDFType nc:dimension nc:VariableType nc:attribute netCDF nc:variable nc:values nc: attribute NCAR isA Object [1] id Person [0,1] firstName [0,1] lastName [0,1] contact LEGEND worksFor participant role= Institution [0,1] name [0,1] type [0,1] contact AbstractClass Class isA Project [0,n] topic type= [0,1] funding Activity [0,1] name [0,1] description [0,1] rights [0,n] date type= [0,n] note [0,n] participant role= [0,n] reference uri= isA isA inheritance association Service [0,1] name [0,1] description isPartOf Campaign serviceId Investigation Ensemble isA isPartOf hasParent hasChild hasSibling Observation Simulation [0,n] simulationInput type= [0,n] simulationHardware generated By NCAR Experiment Dataset [0,1] type [0,1] conventions [0,n] date type= [0,n] format type= uri= [0,1] timeCoverage [0,1] spaceCoverage Analysis isPart Of ESG Metadata Progress Co-developed NcML with Unidata – CF conventions in progress, almost done Developed & evaluated a prototype metadata system Finalized an initial schema for PCM/CCSM – Address interoperability with federal standards and NASA/GCMD via the generation of DIF/FGDC/ISO – Address interoperability with digital libraries via the creation of Dublin Core Testing relational and native XML databases, and OGSA-DAI Exploratory work for first-generation ontology Authoring of discovery metadata in progress NCAR ANL ESG Topology CAS LBNL gridFTP SERVER HRM NCAR HPSS visualize gridFTP gridFTP SERVER gridFTP gridFTP SERVER DISK MSS gridFTP cross-update cross-update RLS query RLS GRAM GATEKEEPER ESG WEB PORTAL Tomcat/Struts authenticate ISI OGSA-DAI MySQL RDBMS NCAR query MyProxy HPSS RLS execute HRM LLNL gridFTP SERVER LAS SERVER DISK cache RLS HRM ORNL submit HRM Collaborations & Relationships CCSM Data Management Group The Globus Project Other SciDAC Projects: Climate, Security & Policy for Group Collaboration, Scientific Data Management ISIC, & Highperformance DataGrid Toolkit OPeNDAP/DODS (multi-agency) NSF National Science Digital Libraries Program (UCAR & Unidata THREDDS Project) U.K. e-Science and British Atmospheric Data Center NOAA NOMADS and CEOS-grid Earth Science Portal group (multi-agency, intnl.) NCAR Immediate Directions Broaden usage of DataMover and refine Continue building metadata catalogs Revisit overall security model and consider simplified approaches Redesign and implement user interface Alpha version of OPeNDAPg – Test and evaluate with client applications Develop automation for data publishing (GT3) Deploy for IPCC runs NCAR The Community Data Portal (CDP) “The dataportal has changed my life…” Ben Kirtman, COLA Provide a common portal to NCAR, UCAR, and university data Provide a sustainable cyberinfrastructure that dramatically lowers the cost of sharing data (there is HUGE interest in this) Directly couple to simulation systems and DataMonster Begin capturing rich metadata and catalog our scientific experiments for the world MSS -> A Petascale Mass Knowledge System Federate internationally (ESG, THREDDS, U.K. e-Science, NOMADS, PRISM, GEON, etc.) NCAR Foster Revolutionary Change Mass Storage System (1.5PB) Petascale Knowledge Repository Establish a new paradigm for managing and accessing scientific data based on semantic organization. NCAR Community Data Portal Purpose: Build an infrastructure using different methods for data exploration and delivery Web-based retrieval and interactive analysis for MSS collections Data sharing for multi-institution cooperative studies Browse, select, compare, download data sets, & specify data subsets using – graphical, text entry, choice of output format Components: User interface, Live Access Server (LAS) Middleware, Ferret, NCL, GrADS File service, local, or DODS Status: Pilot working (2 years), more middleware testing NCAR Data Access Live Access Client Live Access Server Ferret NCL Other Engines DODS Data Collections NCAR Massive Data Simulation & Retrospective CSM, PCM, DSS, MM5, WRF, MICOM, CMIWG Example … Data Analysis NCAR Live Access Server + NCL (Grib Data) NCAR Interface and Reanalysis 2 Sea Level Pressure NCAR Community Data Portal architecture user interface middleware core services UI UI UI UI Struts GDS DODS aggregation server LAS Tomcat Tomcat Tomcat Tomcat catalogs parsing & metadata ingestion catalogs browsing data search & discovery MSS data retrieval data access (OPeNDAP, FTP, HTTP) dataportal.ucar.edu hardware NCAR raid disks MSS data visualization (NCL, Ferret) Community Data Portal Metadata Software ESG metadata DC metadata NcML metadata other metadata parses THREDDS catalog parser application reference stores full XML doc THREDDS catalogs XML native DB (Xindice displays XML viewer web application future advanced query (Xpath, Xquery) schemaspecific stylesheets shreds XML doc into tables relational DB (MySQL) simple query (SQL) uses THREDDS catalogs browser Web application NCAR links to Search & Discovery web application Results: list of triplets (dataset id, metadata schema, metadata URL) CDP Data/Catalog Contributors ACD: MOZART v2.1 standard run (Louisa Emmons) ATD: Radar almost ready for today! CGD: CAS satellite data example (Lesley Smith) CGD: CDAS and VEMAP data (Steve AulenBach, Nan Rosenbloom, Dave Schimmel) CGD: CCSM 1000 year run (Lawrence Buja) CGD: PCM 16 top datasets (Gary Strand) SCD: DSS full data holdings (Bob Dattore, Steve Worley) SCD: VETS example visualization catalog (Markus Stobbs, Luca Cinquini) COLA: Jennifer Adams, Jim Kinter, Brian Doty NCAR Next Steps Recruiting (!) – – – – One student for data ingest One software engineer Systems Expanding storage by 20TB (SCD cosponsor) Ongoing publication of datasets Publishing documents on plans, design, how to partner, standard services, and management procedures Building partnerships, DMWG meeting August NCAR Closing Thoughts Building a sustainable infrastructure for the long-term Difficult, expensive, and time-consuming Requires longer-term projects Team-building is a critical process Collaboration technologies really help Managing all the collaborations is a challenge But extremely valuable Good progress, first real usage NCAR Links Earth System Grid – www.earthsystemgrid.org Community Data Portal – dataportal.ucar.edu NCAR END NCAR We Will Examine Practically Every Aspect of the Earth System from Space in This Decade Longer-term Missions - Observation of Key Earth System Interactions Aqua Terra Landsat 7 QuikScat Aura ICEsat Jason-1 Exploratory - Explore Specific Earth System Processes and Parameters and Demonstrate Technologies Triana GRACE VCL SRTM Cloudsat NCAR PICASSO Courtesy of Tim Killeen, NCAR EO-1 Characteristics of Infrastructure Essential – So important that it becomes ubiquitous Reliable – Example: the built environment of the Roman Empire Expensive – Nothing succeeds like excess (e.g. Interstate system – Inherently one-off (often, few economies of scale) Clear factorization between research and practice – Generally deploy what provably works NCAR CDP Interactions & Opportunities COLA CGD/VEMAP ACD,HAO/WACCM CGD/CCSM, CAM CGD/CAS MMM/WRF UCAR/JOSS UCAR/Unidata CGD,SCD,CU/GridBGC NOAA/NOMADS NCAR GODAE HAO/TIEGCM,MLSO ATD/Radar, HIAPER ACD/Mozart, BVOC, Aqua proposal BioGeo/CDAS SCD/DSS DOE/Earth System Grid DLESE GIS Initiative