The Earth System Grid (ESG) & The Community Data Portal (CDP) (NCAR’s Data & GriD Efforts) for COMMISSION FOR BASIC SYSTEMS INFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK.

Download Report

Transcript The Earth System Grid (ESG) & The Community Data Portal (CDP) (NCAR’s Data & GriD Efforts) for COMMISSION FOR BASIC SYSTEMS INFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK.

The Earth System Grid (ESG)
&
The Community Data Portal (CDP)
(NCAR’s Data & GriD Efforts)
for
COMMISSION FOR BASIC SYSTEMS
INFORMATION SYSTEMS and SERVICES
INTERPROGRAMME TASK TEAM ON THE
FUTURE WMO INFORMATION SYSTEM
KUALA LUMPUR, 20 - 24 OCTOBER 2003
Courtesy: Don Middleton
NCAR Scientific Computing Division
NCAR
“Atkins Report”

“A new age has dawned…”
“The Panel’s overarching recommendation is that the
National Science Foundation should establish and lead a
large-scale, interagency, and internationally coordinated
Advanced Cyberinfrastructure Program (ACP) to create,
deploy, and apply cyberinfrastructure in ways that
radically empower all scientific and engineering research
and allied education. We estimate that sustained new NSF
funding of $1 billion per year is needed to achieve critical
mass and to leverage the coordinated co-investment from
other federal agencies, universities, industry, and
international sources necessary to empower a revolution.
The cost of not acting quickly or at a subcritical level
could be high, both in opportunities lost and in increased
fragmentation and balkanization of the research.”
NCAR
Atkins Report, Executive Summary
The Earth System Grid
http://www.earthsystemgrid.org




U.S. DOE SciDAC funded R&D effort - a “Collaboratory
Pilot Project”
Build an “Earth System Grid” that enables
management, discovery, distributed access,
processing, & analysis of distributed terascale climate
research data
Build upon Globus Toolkit and DataGrid technologies
and deploy (Rubber on the road)
Potential broad application to other areas
NCAR
ESG Team

Ian Foster (PI)
Veronika Nefedova
(John Bresenhan)
(Bill Allcock)
LBNL
– Arie Shoshani
– Alex Sim

ORNL
– David Bernholdte
– Kasidit Chanchio
– Line Pouchard
NCAR
LLNL/PCMDI
– Bob Drach
– Dean Williams (PI)
ANL
–
–
–
–



USC/ISI
– Anne Chervenak
– Carl Kesselman
– (Laura Perlman)

NCAR
–
–
–
–
–
–
David Brown
Luca Cinquini
Peter Fox
Jose Garcia
Don Middleton (PI)
Gary Strand
NCAR
Baseline Numbers

T42 CCSM (current, 280km)
– 7.5GB/yr, 100 years -> .75TB

T85 CCSM (140km)
– 29GB/yr, 100 years -> 2.9TB

T170 CCSM (70km)
– 110GB/yr, 100 years -> 11TB
NCAR
Capacity-related Improvements
Increased turnaround, model development, ensemble of runs
Increase by a factor of 10, linear data

Current T42 CCSM
– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB
NCAR
Capability-related Improvements
Spatial Resolution: T42 -> T85 -> T170
Increase by factor of ~ 10-20, linear data
Temporal Resolution: Study diurnal cycle, 3 hour data
Increase by factor of ~ 4, linear data
CCM3 at T170 (70km)
NCAR
Capability-related Improvements
Quality: Improved boundary layer, clouds, convection, ocean
physics, land model, river runoff, sea ice
Increase by another factor of 2-3, data flat
Scope: Atmospheric chemistry (sulfates, ozone…),
biogeochemistry (carbon cycle, ecosystem dynamics),
middle Atmosphere Model…
Increase by another factor of 10+, linear data
NCAR
Model Improvement Wishlist
Grand Total:
Increase compute by a Factor O(100010000)
NCAR
ESG Scenario



End 2002: 1.2 million files comprising ~75TB
of data at NCAR, ORNL, LANL, NERSC, and
PCMDI
End 2007: As much as 3 PB (3,000 TB) of
data (!)
Current practice is already broken – the
future will be even worse if something isn’t
done…
NCAR
ESG Scenario (cont.)




Data
– Different formats are converted to netCDF
– netCDF is not standardized to the CF model
– Different sites require knowledge of different methods of access
Metadata
– Most kept in online files separate from data and unsearchable unless
one is “in the know”
– Some kept in people’s brains
Access control
– Manual
– Not formalized
Data requests
– Beginnings of a formal process (e.g., the PCMDI model)
– Beginnings of web portals
– Far too much done by hand
– Logging nearly non-existent
NCAR
ESG: Challenges



Enabling the simulation and data
management team
Enabling the core research community in
analyzing and visualizing results
Enabling broad multidisciplinary communities
to access simulation results
We need integrated scientific work environments that enable
smooth WORKFLOW for knowledge development: computation,
collaboration & collaboratories, data management, access,
distribution, analysis, and visualization.
NCAR
ESG: Strategies

Move data a minimal amount, keep it close to
computational point of origin when possible
– Data access protocols, distributed analysis

When we must move data, do it fast and with a
minimum amount of human intervention
– Storage Resource Management, fast networks

Keep track of what we have, particularly what’s on
deep storage
– Metadata and Replica Catalogs

Harness a federation of sites, web portals
– Globus Toolkit -> The Earth System Grid -> The
UltraDataGrid
NCAR
Storage/Data Management
HRM
Tera/Peta-scale
Archive
Tools for reliable
staging, transport,
and replication
Server
Selection
Control
Monitoring
HRM
Server
Tera/Peta-scale
Archive
NCAR
Client
HRM
HRM aka “DataMover”




Running well across DOE/HPSS systems
New component built that abstracts NCAR Mass
Storage System
Defining next generation of requirements with
climate production group
First “real” usage
“The bottom line is that it now works fine and is over 100 times faster than
what I was doing before. As important as two orders of magnitude increase in
throughput is, more importantly I can see a path that will essentially reduce
my own time spent on file transfers to zero in the development of the climate
model database” – Mike Wehner, LBNL
NCAR
OPeNDAP
An Open Source Project for a
Network Data Access Protocol
(originally DODS, the Distributed
Oceanographic Data System)
NCAR
Distributed Data
Access Services
Typical Application
OPeNDAP-g
-Transparency
-Performance
-Security
-Authorization
-(Processing)
Distributed Application
Application
Application
Application
netCDF lib
OPeNDAP Client
ESG client
data
Data
(local)
NCAR
OPeNDAP
Via
http
OPeNDAP
Via
Grid
ESG
+
DODS
OpenDAP Server
ESG Server
Data
(remote)
Big Data
(Multiple remotes)
ESG: NcML Core Schema



For XML encoding of metadata (and data) of any generic netCDF file
Objects: netCDF, dimension, variable, attribute
Beta version reference implementation as Java Library
(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm)
nc:netCDFType
nc:dimension
nc:VariableType
nc:attribute
netCDF
nc:variable
nc:values
nc: attribute
NCAR
isA
Object
[1] id
Person
[0,1] firstName
[0,1] lastName
[0,1] contact
LEGEND
worksFor
participant role=
Institution
[0,1] name
[0,1] type
[0,1] contact
AbstractClass
Class
isA
Project
[0,n] topic type=
[0,1] funding
Activity
[0,1] name
[0,1] description
[0,1] rights
[0,n] date type=
[0,n] note
[0,n] participant role=
[0,n] reference uri=
isA
isA
inheritance
association
Service
[0,1] name
[0,1] description
isPartOf
Campaign
serviceId
Investigation
Ensemble
isA
isPartOf
hasParent
hasChild
hasSibling
Observation
Simulation
[0,n] simulationInput type=
[0,n] simulationHardware
generated
By
NCAR
Experiment
Dataset
[0,1] type
[0,1] conventions
[0,n] date type=
[0,n] format type= uri=
[0,1] timeCoverage
[0,1] spaceCoverage
Analysis
isPart
Of
ESG Metadata Progress

Co-developed NcML with Unidata
– CF conventions in progress, almost done


Developed & evaluated a prototype metadata system
Finalized an initial schema for PCM/CCSM
– Address interoperability with federal standards and NASA/GCMD via
the generation of DIF/FGDC/ISO
– Address interoperability with digital libraries via the creation of
Dublin Core



Testing relational and native XML databases, and OGSA-DAI
Exploratory work for first-generation ontology
Authoring of discovery metadata in progress
NCAR
ANL
ESG Topology
CAS
LBNL
gridFTP SERVER
HRM
NCAR
HPSS
visualize
gridFTP
gridFTP SERVER
gridFTP
gridFTP SERVER
DISK
MSS
gridFTP
cross-update
cross-update
RLS
query
RLS
GRAM
GATEKEEPER
ESG WEB PORTAL
Tomcat/Struts
authenticate
ISI
OGSA-DAI
MySQL
RDBMS
NCAR
query
MyProxy
HPSS
RLS
execute
HRM
LLNL
gridFTP SERVER
LAS SERVER
DISK
cache
RLS
HRM
ORNL
submit
HRM
Collaborations & Relationships








CCSM Data Management Group
The Globus Project
Other SciDAC Projects: Climate, Security & Policy for Group
Collaboration, Scientific Data Management ISIC, & Highperformance DataGrid Toolkit
OPeNDAP/DODS (multi-agency)
NSF National Science Digital Libraries Program (UCAR &
Unidata THREDDS Project)
U.K. e-Science and British Atmospheric Data Center
NOAA NOMADS and CEOS-grid
Earth Science Portal group (multi-agency, intnl.)
NCAR
Immediate Directions





Broaden usage of DataMover and refine
Continue building metadata catalogs
Revisit overall security model and consider simplified
approaches
Redesign and implement user interface
Alpha version of OPeNDAPg
– Test and evaluate with client applications

Develop automation for data publishing (GT3)

Deploy for IPCC runs
NCAR
The Community Data Portal (CDP)
“The dataportal has changed my life…”
Ben Kirtman, COLA






Provide a common portal to NCAR, UCAR, and university data
Provide a sustainable cyberinfrastructure that dramatically lowers the cost
of sharing data (there is HUGE interest in this)
Directly couple to simulation systems and DataMonster
Begin capturing rich metadata and catalog our scientific experiments for
the world
MSS -> A Petascale Mass Knowledge System
Federate internationally (ESG, THREDDS, U.K. e-Science, NOMADS, PRISM,
GEON, etc.)
NCAR
Foster Revolutionary Change
Mass Storage
System (1.5PB)
Petascale Knowledge
Repository
Establish a new paradigm for managing and accessing
scientific data based on semantic organization.
NCAR
Community Data Portal



Purpose:
 Build an infrastructure using different methods for data
exploration and delivery
 Web-based retrieval and interactive analysis for MSS
collections
 Data sharing for multi-institution cooperative studies
 Browse, select, compare, download data sets, & specify data
subsets using – graphical, text entry, choice of output format
Components:
 User interface, Live Access Server (LAS)
 Middleware, Ferret, NCL, GrADS
 File service, local, or DODS
Status:
 Pilot working (2 years), more middleware testing
NCAR
Data Access
Live Access Client
Live Access Server
Ferret
NCL
Other Engines
DODS
Data Collections
NCAR
Massive
Data
Simulation & Retrospective
CSM, PCM, DSS,
MM5, WRF, MICOM,
CMIWG
Example … Data Analysis
NCAR
Live Access Server + NCL
(Grib Data)
NCAR
Interface and Reanalysis 2
Sea Level Pressure
NCAR
Community Data Portal architecture
user interface
middleware
core services
UI
UI
UI
UI
Struts
GDS
DODS aggregation server
LAS
Tomcat
Tomcat
Tomcat
Tomcat
catalogs parsing &
metadata ingestion
catalogs browsing
data search & discovery
MSS data retrieval
data access
(OPeNDAP, FTP, HTTP)
dataportal.ucar.edu
hardware
NCAR
raid
disks
MSS
data visualization
(NCL, Ferret)
Community Data Portal Metadata Software
ESG
metadata
DC
metadata
NcML
metadata
other
metadata
parses
THREDDS catalog parser
application
reference
stores full
XML doc
THREDDS
catalogs
XML native DB
(Xindice
displays
XML viewer
web application
future
advanced query
(Xpath, Xquery)
schemaspecific
stylesheets
shreds XML doc
into tables
relational DB
(MySQL)
simple query
(SQL)
uses
THREDDS catalogs browser
Web application
NCAR
links to
Search & Discovery
web application
Results: list of triplets
(dataset id, metadata schema,
metadata URL)
CDP Data/Catalog Contributors









ACD: MOZART v2.1 standard run (Louisa Emmons)
ATD: Radar almost ready for today!
CGD: CAS satellite data example (Lesley Smith)
CGD: CDAS and VEMAP data (Steve AulenBach, Nan
Rosenbloom, Dave Schimmel)
CGD: CCSM 1000 year run (Lawrence Buja)
CGD: PCM 16 top datasets (Gary Strand)
SCD: DSS full data holdings (Bob Dattore, Steve Worley)
SCD: VETS example visualization catalog (Markus Stobbs,
Luca Cinquini)
COLA: Jennifer Adams, Jim Kinter, Brian Doty
NCAR
Next Steps

Recruiting (!)
–
–
–
–



One student for data ingest
One software engineer
Systems
Expanding storage by 20TB (SCD cosponsor)
Ongoing publication of datasets
Publishing documents on plans, design, how to
partner, standard services, and management
procedures
Building partnerships, DMWG meeting August
NCAR
Closing Thoughts

Building a sustainable infrastructure for the
long-term
 Difficult, expensive, and time-consuming
 Requires longer-term projects

Team-building is a critical process
 Collaboration technologies really help

Managing all the collaborations is a challenge
 But extremely valuable

Good progress, first real usage
NCAR
Links

Earth System Grid
– www.earthsystemgrid.org

Community Data Portal
– dataportal.ucar.edu
NCAR
END
NCAR
We Will Examine Practically Every Aspect of the Earth
System from Space in This Decade
Longer-term Missions - Observation of Key Earth System Interactions
Aqua
Terra
Landsat 7
QuikScat
Aura
ICEsat
Jason-1
Exploratory - Explore Specific Earth System Processes and Parameters and
Demonstrate Technologies
Triana
GRACE
VCL
SRTM
Cloudsat
NCAR
PICASSO
Courtesy of Tim Killeen, NCAR
EO-1
Characteristics of Infrastructure

Essential
– So important that it becomes ubiquitous

Reliable
– Example: the built environment of the Roman Empire

Expensive
– Nothing succeeds like excess (e.g. Interstate system
– Inherently one-off (often, few economies of scale)

Clear factorization between research and practice
– Generally deploy what provably works
NCAR
CDP Interactions & Opportunities










COLA
CGD/VEMAP
ACD,HAO/WACCM
CGD/CCSM, CAM
CGD/CAS
MMM/WRF
UCAR/JOSS
UCAR/Unidata
CGD,SCD,CU/GridBGC
NOAA/NOMADS
NCAR









GODAE
HAO/TIEGCM,MLSO
ATD/Radar, HIAPER
ACD/Mozart, BVOC,
Aqua proposal
BioGeo/CDAS
SCD/DSS
DOE/Earth System Grid
DLESE
GIS Initiative