Data Grids for HPC: Geographical Information System Grids Marlon Pierce Geoffrey Fox Indiana University December 7 2004 Internet Seminar.
Download
Report
Transcript Data Grids for HPC: Geographical Information System Grids Marlon Pierce Geoffrey Fox Indiana University December 7 2004 Internet Seminar.
Data Grids for HPC:
Geographical Information
System Grids
Marlon Pierce
Geoffrey Fox
Indiana University
December 7 2004
Internet Seminar
1
Overview from Previous
Lectures
2
Parallel Computing
Parallel processing is built on breaking problems up
into parts and simulating each part on a separate
computer node
There are several ways of expressing this breakup into
parts with Software:
• Message Passing as in MPI or
• OpenMP model for annotating traditional languages
• Explicitly parallel languages like High Performance Fortran
And several computer architectures designed to
support this breakup
• Distributed Memory with or without custom interconnect
• Shared Memory with or without good cache
• Vectors with usually good memory bandwidth
3
What are Web Services?
Web Services are distributed computer programs that
can be in any language (Fortran .. Java .. Perl .. Python)
The simplest implementations involve XML messages
(SOAP) and programs written in net friendly languages
like Java and Python
Here is a typical e-commerce use?
WSDL interfaces
Security
WSDL interfaces
Payment
Credit Card
Catalog
Warehouse
shipping
4
What Is the Connection?
Both MPI and Web Services rely upon messaging to
interact.
But the difference is in speed of message transmission
• MPI useful for microsecond communication speeds.
Clusters, traditional parallel computing
• Web Services communicate with Internet speeds
Millisecond communication times at best.
This implies that we have (at least) a two-level
programming model.
• Level 1: MPI within science applications on clusters and
HPC.
• Level 2: Programming between science applications.
5
Two-level Programming I
The Web Service (Grid) paradigm implicitly assumes a two-level
Programming Model
We make a Service (same as a “distributed object” or “computer
program” running on a remote computer) using conventional
technologies
• C++ Java or Fortran Monte Carlo module perhaps running with MPI on
a parallel machine
• Data streaming from a sensor or Satellite
• Specialized (JDBC) database access
Such services accept and produce data from other services, files
and databases
The Grid is used to coordinate such services assuming we have
solved problem of programming the service
Service
Data
6
Two-level Programming II
The Grid is discussing the composition of distributed
services with the runtime Service1
Service2
interfaces to Grid as
opposed to UNIX
Service3
Service4
pipes/data streams
Familiar from use of UNIX Shell, PERL or Python
scripts to produce real applications from core programs
Such interpretative environments are the single
processor analog of Grid Programming
Some projects like GrADS from Rice University are
looking at integration between service and composition
levels but dominant effort looks at each level separately
7
3 Layer Programming Model
Application
(level 1 Programming)
Application Semantics (Metadata, Ontology)
Level 2 “Programming”
MPI Fortran C++ etc.
Semantic Web
Basic Web Service Infrastructure
Web Service 1
WS 2
WS 3
WS 4
Workflow (level 3) Programming BPEL
Semantic Web adds a another layer between workflow and
Services representing traditional applications
8
Data and Science Applications
Two- (or three-) level programming applies to all
applications.
Typically we need to bind together HPC and non-HPC
parts.
• How do you provide data to your application?
• How do you share data between applications?
• How do you communicate results to analysis and visualization
programs?
This is particularly important as the size and quality of
observational data is growing rapidly.
Q: How do you easily bind together science apps and
remote data sources?
• A: Web Services (and Grids) provide the unifying
architecture.
9
Grid Libraries
Programming the Grid has many similarities with
conventional languages
• In HPSearch you use similar Scripting languages
Grids are particularly good at supporting user
interfaces as the browser is a particular service
• Portal technology important “gift” of Grids for HPC
Most promising (and not exploited often) is building
Grid “Libraries” which are collections of services
which can be re-used in several applications
• Mastercard service is a typical business Grid library
• Visualization, Sensor processing, GIS are naturally
distributed components of a HPC application that can be
developed as Grid libraries
10
Data Grids for HPC
11
Data Deluged Science
In the past, we worried about data in the form of parallel I/O or
MPI-IO, but we didn’t consider it as an enabler of new
algorithms and new ways of computing
Data assimilation was not central to HPCC
ASC set up because didn’t want test data!
Now particle physics will get 100 petabytes from CERN
• Nuclear physics (Jefferson Lab) in same situation
• Use around 30,000 CPU’s simultaneously 24X7
Weather forecasting, climate, solid earth (EarthScope, Eath
Systems Grid, GEON)
• We discussed our project SERVOGrid in October 2004 lecture.
Bioinformatics curated databases (Biocomplexity only 1000’s of
data points at present)
Virtual Observatory and SkyServer in Astronomy
Environmental Sensor nets
12
Data Deluge @ Home
In 2003, all of Marion County, IN (including Indianapolis) was surveyed using
Light Detection and Ranging (LiDAR) sensing.
GRW, Inc flew a Cessna 337 airplane over the entire county to produce
digitized maps.
• 1 point per square meter.
• 495 square miles total.
Can be used to create high resolution contour maps….
But what do you do with all of the data?
•LiDAR data represents 3 orders of
magnitude increase in data resolution over
what is used today in conventional flood
prediction (B. Engles, Purdue).
•Flood modeling codes thus must become
HPC codes to handle the size of newly
available data.
13
Example Data Grid:
The Earth System Grid
http://www.earthsystemgrid.org
U.S. DOE SciDAC funded R&D effort
Build an “Earth System Grid” that enables
management, discovery, distributed access,
processing, & analysis of distributed terascale climate
research data
A “Collaboratory Pilot Project”
Build upon ESG-I, Globus Toolkit, DataGrid
technologies, and deploy
Potential broad application to other areas
14
ESG Data Sets
Community Climate Systems Model data
• This is data that is compatible with the National Center for
Atmospheric Research (NCAR) global climate model, CCSM
Couples atmospheric, land surface, ocean, and sea ice
models.
• This is a US government model for climate modeling and
prediction.
• http://www.ccsm.ucar.edu/
Parallel Climate Model data
• Data compatible with extensions to CCSM.
• Uses same atmospheric model but different ocean and sea ice
models.
15
ESG Challenges
By the end of 2003, DOE-sponsored climate change research had
produced 100 TB of scientific data.
• Stored across several DOE sites and NCAR.
Consequence of HPC, will only escalate as models can simulate
global weather patterns at increasingly fine resolution.
Basic problems in data management
• What is in the data files (metadata)?
• How were data created and by whom (provenance)?
• How data be stored and moved
between sites efficiently?
• How can data be delivered to
scientific community?
ESG web portal
16
17
ESG Data Sets
Community Climate Systems Model data
• This is data that is compatible with the National Center for
Atmospheric Research (NCAR) global climate model, CCSM
Couples atmospheric, land surface, ocean, and sea ice
models.
• This is the US government’s workhorse code for climate
modeling and prediction.
• http://www.ccsm.ucar.edu/
Parallel Climate Model data
• Data compatible with extensions to CCSM.
• Uses same atmospheric model but different ocean and sea ice
models.
18
Example Data Grid: GEON
Project Goal: Prototype interpretive environments of the future
in Earth Sciences.
Use advanced information technologies to facilitate
collaborative, inter-disciplinary science efforts.
Scientists will be able to discover data, tools, and models via
portals, using advanced, semantics-based search engines and
query tools, in a uniform authentication environment that
provides controlled access to a wide range of resources.
• A prototype “Semantic Grid”
A services-based environment facilitates creation of scientific
workflows that are executed in the distributed environment.
Advanced GIS mapping, 3D, and 4D visualization tools allow
scientists to interact with the data.
www.geongrid.org
19
GEON Grid Application: SYNSEIS
• SYNSEIS is a grid application that provides an
opportunity for seismologists and other earth
science partners to compute and study 3D
seismic records to understand complex subsurface
structures.
• SYNSEIS is built using a service-based
architecture. While it provides users an easy-touse GUI to access data, models and compute
resources, it also provides “connectors” (APIs) for
developers should they choose to utilize any of its
components in other applications.
20
SYNSEIS Architecture
GEON Portal
Crustal
Crustal
Crustal
Models
Models
Models
TeraGrid
SDSC
SYNSEIS
(FLASH GUI)
Cornell
Map Server
SOAP
Web service
SynSeis
Engine
IRIS
DMC
Waveform and seismic event
catalogs: www.iris.edu
GASS
GRAM
GridFTP
GSI
TeraGrid
NCSA
LLNL
MCR
21
22
GEON SYNSEIS Conclusions
Using the Grid technology, GEON team was able to bring an extremely
complex and cumbersome seismic data analysis procedure to a level that can
be used by anyone efficiently and effectively, hence SYNSEIS is a first step
towards faster discovery.
Democratization of community resources allows not only GEON researchers
but also external community members to access state-of-the-art software
and tools.
Although the tool is developed for GEON applications, it holds a
tremendous potential for projects like EarthScope. SYNSEIS can be used by
EarthScope researchers to conduct timely analysis of collected data
SYNSEIS also has a high potential to be used in educational environments
allowing students to experiment with data and make their own
earthquakes.
SYNSEIS has allowed us to practice building distributed data and
computational resources.
23
SERVOGrid Example: GeoFEST
SERVOGrid was discussed in more detail in the October lecture
of this series.
• But worth another mention in this context.
GeoFEST is
• Geophysical Finite Element Simulation Tool
• GeoFEST solves solid mechanics forward models with these
characteristics:
2-D or 3-D irregular domains
1-D, 2-D or 3-D displacement fields
Static elastic or time-evolving viscoelastic problems
Driven by faults, boundary conditions or distributed loads
• GeoFEST runs in a variety of computing environments:
UNIX workstations (including LINUX, Mac OS X, etc.)
Web portal environment
Parallel cluster/supercomputer environment
GeoFEST output can be compared directly with current and
future InSAR satellite data.
24
GeoFEST and Data Grids
GeoFEST works directly with Earth fault data.
Luckily for us, there is a Web Service data source for earth faults
in California
• QuakeTables: accessible for human use through
http://infogroup.usc.edu:8080/public.html
http://complexity.ucs.indiana.edu:8282/jetspeed/index.jsp
• USC, UC-Irvine, and IU designed and built this as part of the SERVO
project.
But GeoFEST needs programmatic access to the fault data
• Users design layer and fault geometry problems and create finite element
meshes through Web portal interface.
Like GEON, we use portlets.
Portlets are a standard way to make Java-based (and other) portals
out of reusable components.
• Must then pass this information to GeoFEST as an input file.
• GeoFEST on some remote host from the data.
25
Browser Interface
HTTP(S)
User Interface Server
WSDL WSDL WSDL WSDL
SOAP
WSDL
DB Service 1
SOAP
WSDL WSDL
Job Sub/Mon
And File
Services
WSDL
Viz Service
JDBC
DB
Host 1
Operating and
Queuing
Systems
Host 2
IDL
GMT
Host 3
26
27
28
a
Site-specific Irregular
Scalar Measurements
Ice Sheets
Constellations for Plate
Boundary-Scale Vector
Measurements
a
a
Volcanoes
PBO
Greenland
Long Valley, CA
Topography
1 km
Stress Change
Northridge, CA
Earthquakes
Hector Mine, CA
29
OGSA-DAI
Grid Services
Grid
Grid Data
Assimilation
HPC
Simulation
Analysis
Control
Visualize
Data Deluged
Science
Computing
Architecture
Distributed Filters
massage data
For simulation
30
Data Assimilation
Data assimilation implies one is solving some optimization
problem which might have Kalman Filter like structure
Nobs
min
Theoretical Unknowns
2
Data
(
position
,
time
)
Simulated
_
Value
Error
i
i
2
i 1
Due to data deluge, one will become more and more dominated
by the data (Nobs much larger than number of simulation
points).
Natural approach is to form for each local (position, time)
patch the “important” data combinations so that optimization
doesn’t waste time on large error or insensitive data.
Data reduction done in natural distributed fashion NOT on
HPC machine as distributed computing most cost effective if
calculations essentially independent
• Filter functions must be transmitted from HPC machine
31
Distributed Filtering
Nobslocal patch >> Nfilteredlocal patch ≈ Number_of_Unknownslocal patch
In simplest approach, filtered data gotten by linear transformations on
original data based on Singular Value Decomposition of Least squares
matrix
Send needed Filter
Receive filtered data
Nobslocal patch 1
Nfilteredlocal patch 1
Geographically
Distributed
Sensor patches
Nobslocal patch 2
Factorize Matrix
to product of
local patches
Nfilteredlocal patch 2
Distributed
Machine
HPC Machine
32
Standards For Geographic
Data Services
33
The Story So Far…
HPC applications generate huge amounts of data.
• Constant problem for all HPC centers, including DOD MSRCs.
• Managing scientific information about these applications is just as
important as storage technology.
HPC applications use observational data as input.
• Projects like the ESG, GEON, and SERVO illustrate how HPC
applications need to be coupled to data sources.
• Quantity of observational data is growing rapidly, opening fields for nontraditional HPC (LiDAR and flood modeling).
Huge amounts of new data potentially drive new HPC
applications (LiDAR->Flood modeling)
Earth sciences are a focus of our examples, but really, many
applications have data sources that are geographically described.
• Weather prediction is an obvious example.
Thus we see the importance of coupling GIS data grid services to
HPC applications for both data access and
visualization/interpretation.
34
What is GIS?
Geographic Information Systems
• ESRI: commercial company with many popular GIS
products.
• Open Geospatial Consortium (formerly OpenGIS
Consortium).
• We will focus on OGC since they define open and
interoperable standards.
What are the characteristics of a GIS system?
• Need data models to represent information
• Need services for remotely accessing data.
• Need metadata for determining what is stored in the services.
35
GML: A Data Model For GIS
GML 3.x is a interconnected suite of over 20 connected
XML schemas.
GML is an abstract model for geography.
With GML, you can encode
• Features: abstract representations of map entities.
• Geometry: encode abstractly how to represent a feature
pictorially.
• Coordinate reference systems
• Topology
• Time, units of measure
• Observation data.
36
Example Use of GML
The SCIGN (Southern
California Integrated GPS
Network) maintains online
catalogs of GPS stations.
Collective data for each site is
made available through
online catalogs.
• Using various text formats.
This is not suitable for
processing, but GML is.
GML can be used to describe
GPS using Feature.xsd
schema, with values encoded
at GPS observations.
www.crisisgrid.org.
37
Open GIS Services
GML abstract data models can encode data but you need
services to interact with the remote data.
Some example OGC services include
• Web Feature Service: for retrieving GML encode features, like faults,
roads, county boundaries, GPS station locations,….
• Web Map Service: for creating maps out of Web Features
• Sensor Grid Services: for working with streaming, time-stamped data.
Problems with OGC services
• Not (yet) Web Service compliant
“Pre” web service, no SOAP or WSDL
Use instead HTTP GET/POST conventions.
• Often define general Web Service services as specialized standards
Information services
Notification services in sensor grids
38
Anatomy of WFS (G. Aydin)
WFS provides three major services as described in OGC specification:
• GetCapabilities: The clients (WMS servers or users) starts with requesting a
document from WFS which describes it’s abilities. When a getCapabilities request
arrives, the server dynamically creates a capabilities document and returns this.
• This is OGC’s formalization of metadata, so important to GEON, ESG, etc.
• DescribeFeatureType: After the client receives the capabilities document he/she
can request a more detailed description for any of the features listed in the WFS
capabilities document.
• The WFS returns an XML schema that describes the requested feature.
• Metadata about a specific entry.
• GetFeature: The client can ask the WFS to return a particular portion of any
feature data.
• GetFeature requests contain some property names of the feature and a Filter
element to describe the query.
• The WFS extracts the query and bounding box from the filter and queries the
feature databases.
• The results obtained from the DB query are converted the feature’s GML
format and returned to the client as a FeatureCollection.
39
Example WFS Capability Entries
ELEMENT NAME
DESCRIPTION
Name
A name the service provider assigns to the web feature service instance.
Title
Human-readable title to briefly identify this server in menus.
Abstract
Descriptive narrative for more information about the server.
Keyword
Contains short words to aid catalog searching.
OnlineResource
Defines the top-level HTTP URL of this service. Typically the URL of a "home page" for the
service.
Fees
Contains a text block indicating any fees imposed by the service provider for usage of the
service or for data retrieved from the WFS. The keyword NONE is reserved to mean no
fees.
AccessConstraints
Text block describing any access constraints imposed by the service provider on the WFS or
data retrieved from that service. The keyword NONE is reserved to indicate no access
constraints are imposed.
40
Sample Feature- CA Fault Lines
<gml:featureMember>
<fault>
<name>Northridge2</name>
<segment>Northridge2</segment>
<author>Wald D. J.</author>
<gml:lineStringProperty>
<gml:LineString
srsName="null">
<gml:coordinates>
-118.72,34.243 -118.591,34.176
</gml:coordinates>
</gml:LineString>
</gml:lineStringProperty>
</fault>
</gml:featureMember>
After receiving getFeature
request, WFS decodes this
request, creates a DB query
from it and queries the
database.
WFS then retrieves the
features from the database
and converts them into GML
documents.
Each feature instance is
wrapped as a
gml:featureMember element.
WFS returns a
wfs:FeatureCollection
document which includes all
featureMembers returned in
the query result.
41
`
WMS
le
ec
tio
n
Fe
a
ol
tur
eC
eC
oll
Ge
tF
ea
e
r
tu
r
tu
a
Fe
a
Fe
et
G
tur
e
Client
io
ct
n
•A WFS can serve multiple
feature types data.
•WFS returns the results of
GetFeature requests as GML
documents (Feature Collections).
•Clients may include other
services as well as humans.
s
ad
i l ro ]
a
b
R [a-
Railroads
WFS Server
Hi
River [a-d]
Bridge [1-5]
ry
SQL Query
ue
LQ
SQ
SQ
L
gw
ay
[1
2-
Q
ue
18
ry
]
Interstate
Highways
Rivers
Bridges
90
42
Schematic Interactions Between GIS
Services
Client
WMS
Client
IS
WFS
WFS
california fault data
california river data
@complexity
@gf1
WFS
california boundary data
@gf1
43
Defining IS
The central IS block in the proceeding diagram represents
nebulous “information services.”
Information services are needed to bind together various GIS
and other services.
• What are their URLs? How do you interact with them (WSDL)? What do
they do (capabilities)?
The OGC defines information services, but they are specialized
to GIS.
• Web Catalogue Service: state appears uncertain.
• Web Registry Service: a common mechanism to classify, register, describe,
search, maintain and access information about OGC Web resources.
But if they adopt Web Service standards, they get Web Service
information system solutions for free.
• IS is a more general problem than just GIS.
44
Universal Description, Discovery
and Integration
UDDI is the standard for building service registries and for
describing their contents.
• UDDI is part of the WS-I core: http://www.ws-i.org/
But no one seems to like it…
Centralized solution
• Single point of failure
Poor discovery model
• No uniform way of querying about services, service interfaces and
classifications.
• Limited query capabilities: search for services restricted to WS name and
its classification
Stale data in registries
• Out-of-date service documents in UDDI registries.
• Need a leasing system
• Registry entries need to be dynamically updated
45
UDDI Has Other Problems
Many Web Services need to maintain the concept of
state between themselves during complicated
interactions.
For example, for better performance, I may wish to
cache maps in a Web Map Server instead of
reconstructing it via calls to a Web Feature Service
every time.
This is basically a glorified HTTP Cookie problem.
We need a way to store this kind of volatile session state
data in light weight data.
• UDDI==heavyweight.
So IS must support both registries and contexts.
46
GIS Service Registries
Functional capabilities of a GIS service is defined in
“capabilities.xml” file
An information service can gather metadata about functional
requirements of a GIS service
• By processing the capabilities file in an automated fashion when a service
is registered
• By having the service provider declare these capabilities when publishing
a service
• Information System API introduce a library for XML Schema Processing
of different capability files
UDDI with the geospatial focus of GIS Services
• Data layers (features) of a GIS service may have varying geospatial
coverage
• UDDI registries do not natively support spatial queries.
• We use existing geographic taxonomies such as QuadCode taxonomy to
associate service descriptions with spatial coverage.
47
WS-Context: Session State Service
Repository of Context Information
Allows for
• Sharing Context info
Info related to a particular transaction in multiple Web
Service interactions
• Sharing data
Data in multiple Web service interactions
Simply put, its a Distributed Variation of Shared
Memory.
See http://www.arjuna.com/library/specs/ws_caf_10/WS-CTX.pdf
48
An Information Service with both WS-Registry and WSContext capability
WMS
WSDL
WMS
WSDL
WMS
WSDL
HTTP(S)
WSDL
Information Service
WSDL WSDL WSDLWSDL WSDL
SOAP
SOAP
SOAP
SOAP
WSDL
WSDL
WSDL
WSDL
WSDL
WS-Context
Service
WS-Context
Service
UDDI Registry
Service
UDDI Registry
Service
UDDI Registry
Service
JDBC
JDBC
JDBC
JDBC
DB
WS-Context I
JDBC
DB
WS-Context II
WS Context Replica Group
DB
UDDI-Replica I
DB
UDDI-Replica II
UDDI Replica Group
DB
UDDI-Replica III
49
GIS FTHPIS Implementation
Status (M. S. Aktas)
UDDI v.3E implementation
metadata extension [completed]
Processing geographic taxonomies to enable
UDDI support spatial queries [completed]
WSDL interface to UDDI v.3 [completed]
WSDL interface to WS-Context 1.0
Monitoring scheme
• Leasing [completed]
• Heart-beat
WS-Discovery implementation
metadata extension [completed]
WSDL interface to Information Service
Message dissemination via Soap Handler Environment
Caching mechanism
Replication mechanism
50
Concluding Remarks
High Performance Computing will be increasingly data
driven.
High volumes of observational data will push many
applications into the realms of HPC.
There must be an overarching architecture to integrate
data sources, HPC applications, visualization
applications, users.
• Web Service architectures provide this.
• Use to build Grid libraries
Large amounts of data related to the earth’s surface.
GIS data and service standards need to be integrated
into HPC applications.
51