XML for Scientific Computing Several case studies for XML data in scientific computing

Download Report

Transcript XML for Scientific Computing Several case studies for XML data in scientific computing

XML for Scientific
Computing
Several case studies for XML
data in scientific computing
Overview

We will present case studies of the following
systems
•
•
•
•


XSIL: Extensible Scientific Interchange Language
XDMF: Extensible Data Model and Format
Discipline Specific XML: ChemicalML
Gateway Application Descriptors (plus Castor)
XML by itself is just markup, like HTML without
a browser. Each of the above uses a related
set of software to manipulate the XML data.
We present several examples of XML to give
you an overview.
• These are not tutorials, just examples of how others
are using XML.

We conclude with some remarks about
standards for science applications.
Overview of Case Studies

XSIL and XDMF are examples of
representing (meta)data for scientific
computing.
• Concentrate on data structures, data I/O.
• Meaning of data not described.

ChemicalML mark up domain specific
data.
• Meaningfully describes data content.

Gateway application XML metadata
describes science codes themselves.
• Codes, host computers, queues

All possess a data object model.
• Object oriented data descriptions guide the
markup tag definitions.
XSIL
XML tags for generic
scientific data markup, with
related Java software.
XSIL

Developed in support of several projects
involving CACR at Caltech.
• Example: LIGO, Digital Sky
• Roy Williams, CalTech.




See http://www.cacr.caltech.edu/SDA/xsil/ for
more information and free software.
XSIL developed for astronomical and
gravitational wave communities.
But provides general purpose tags.
Also comes with software for building Java
applications that manipulate, display XSIL
documents.
XSIL Tags

XSIL defines a small number of tags
•
•
•
•
•
•
•
•

XSIL: base container for the object model.
Comment:
Param: an arbitrary name/value pair
Time: describes time, plus format
Table: data in columns and rows
Array: table data with specific size
URL: and
Streams: for handling data
We’ll now go over some of these in detail..
The XSIL Tag I



XSIL documents map to a document object
model with associated handling code.
The root tag for XSIL is <XSIL>:
<XSIL Name=“Example”
Type=“Examples.MyExample>
…
</XSIL>
Type points to the Java code that should
process this file.
• It’s some file called MyExample.java in the
package Examples.
The XSIL Tag II


XSIL tags can be nested if different parts of the
XSIL document need to be handled by different
codes.
<XSIL Name=“Example” Type=“Examples.MyExample”>
…
<XSIL Name=“Subsection” Type=“Examples.Subsection”>
…
</XSIL> </XSIL>
XSIL tags thus are the base container in a
generic object hierarchy.
• MyExample object “has a” Subsection object
More On Object Containers


Consider an Electromagnetics example:
• A target is represented as a grid for finite
difference integration of Maxwell’s eqns.
• The base input file contains one or more materials.
• Each material has specific EM properties.
If translated to XSIL, could look like this:
<XSIL Name=“EMRoot” Type=“CEA.Root”>
<!– Some general parameters -->
<XSIL Name=“EMMaterial” Type=“CEA.Material”>
<!– Some info describing the material. -->
</XSIL>
</XSIL>
Parameters



Each XSIL tag can contain one or
more parameters.
Params are arbitrary name/value
pairs.
Params optionally have units.
<XSIL …>
<Param Name=“Color”>Red</Param>
<Param Name=“Weight” Unit=“kg”>3.14</Param>
</XSIL>
Tables


Params associate one value per
name
Tables support multiple values
• A Table row can have any number of
values.


Each table contains column
definitions followed by an arbitrary
number of entries.
Tables get data from Streams
(discussed later).
Example Table
<XSIL…>
…
<Table>
<Column Name=“Color” Type=“string” />
<Column Name=“Weight” Type=“float” Unit=“kg” />
<Column Name=“Length” Type=“float” Unit=“meter” />
<Stream Type=“Local” Delimiter=“,” >
“Red”,100.2,0.2
“Green”,21.7,1.2
</Stream>
</Table>
</XSIL>
XSIL Arrays



XSIL arrays are similar to Fortran and C
arrays.
For mixed type data, use Tables.
If all data is the same (integers, floats), use
Arrays.
<Array Type=“int” >
<Dim Name=“x-dim”>2</Dim>
<Dim Name=“y-dim”>2</Dim>
<Stream Type=“Local” Delimiter=“,”>
137,42
8,13
</Stream>
</Array>
XSIL Streams


XSIL Streams can be used to load data
Data sources can be
• In the file itself (as shown in previous examples).
• From files on disk
• From URLs (http://, ftp://, and file:// supported)

Loading data from disk
<Stream Type=“Remote” Encoding=“Littleendian”>
/home/user1/data/datafile.dat
</Stream>

Loading data from URLs
<Stream Type=“Remote”>
http://my.server.edu/XSILdata/datafile.dat
</Stream>
Ex: Use XSIL to describe input data
<XSIL Name=“InputData” Type=“Examples.InDataHandler”>
<XSIL Name=“Target 1” Type=“Examples.Target”>
<Param Name=“Target”>Scud</Param>
<Param Name=“dx”>0.1</Param>
<Array>
<Dim Name=“X-Dimension”>100</Dim>
<Dim Name=“Y-Dimension”>100</Dim>
<Stream Type=“Remote”>
/home/mpierce/data/mydata.dat
</Stream>
</Array>
</XSIL>
<XSIL Name=“Target 2” Type=“Examples.Target”>
<!– Another target -->
</XSIL>
</XSIL>
Table and Array Types

Table and Array data can be (in bits)
•
•
•
•
•
•
•
•
•
•
boolean (1)
byte (8)
short (16)
int (32)
long (64)
float (32)
double (64)
floatComplex (64)
doubleComplex (128)
string (arbitrary length)
Using XSIL


The previous example just marks up data.
XSIL also comes with Java bindings that
• Read the file and parse it.
• Extract parameter values, units, etc.
• Read in and manipulate tables, arrays

Central ideas:
• Each XSIL tag corresponds to a Java class
• XSIL’s Type points to your custom driver code
that uses the XSIL classes.
XSIL Coding Example

Consider following small XSIL
example
<XSIL Type=“Examples.MyExample”>
<Param Name=“x0”>12.0</Param>
<Param Name=“dx”>0.1</Param>
</XSIL>
XSIL Java Code Example
package extensions.Examples
import org.escience.XSIL
public class MyExample {
String x0,dx;
XSIL root;
public MyExample(String xsilFileName) {
root=new XSIL(xsilFileName);
}
public void construct() {
for(int i=0;i<root.getChildCount();i++) {
XSIL x=root.getChild(i);
if(x instance of Param) {
Param p=(Param)x;
if(p.getName().equals(“x0”)) x0=p.getText();
if(p.getName().equals(“dx”)) dx=p.getText();
}}}}
Code Notes


All classes (Param, Table, etc.) extend
the XSIL class.
Pass the XSIL class root the XSIL path
through the constructor.
• XSIL handles all parsing


XSIL class defines getChildCount(),
getChild() methods.
Param class defines getName() and
getText() methods.
XSIL Summary


Defines a small set of general
purpose tags for scientific data.
Data itself is not directly marked up.
• Read in through streams

XSIL software maps Java classes to
XSIL tags.
• Convenient for working with XSIL docs.
• DOM classes are much more
cumbersome to use.
XDMF
A data model geared toward
finite element codes, with
associated software in C++,
Java, and TCL
ICE and XDMF


ICE (Interdisciplinary Computing Environment)
is a comprehensive project at ARL MSRC for
providing a common software platform for DoD
scientific codes.
• Jerry Clarke, lead developer
XDMF (Extensible Data Model and Format)
provides a common data format for several
different codes
• Primary focus: finite element codes for fluid
dynamics and structural mechanics.
• XDMF and related software provides the
backbone for loosely coupling applications
and visualization.
XDMF Design



XDMF divides data into “light” and
“heavy” types.
Light data, or metadata, is formatted
in XML and will be described in more
depth.
Heavy data is in HDF5 and not
presented here.
XDMF Basic Concepts



XDMF basic tags are <DataStructure> and
<DataTransform>
<DataStructure> defines the actual data.
<DataTransform> defines the area of
interest (AOI) in the data.
• AOI defined by coordinates, a function, or a
hyperslab.

<DataTransform> contains one or more
<DataStructures>
• The transform defines how the data structure
will be filtered.
Simple Data Structure
The example below is for 655 XYZ values
in the indicated HDF5 file.
<DataStructure Name="Some XYZ Data"
Type="Float"
Dimensions="655 3">
MyData.h5:/MyXYZdata
</DataStructure>
 Simple character data can also be included
directly the XML document.

Data Structure for Mesh
Connections and Pressures
<DataStructure
Name="Connections"
Type="Int"
Precision="8"
Dimensions="100 8" >
MyData.h5:/MyConns
</DataStructure>
<DataStructure
Name="Pressure"
Type="Float"
Precision="8"
Dimensions="100">
MyData.h5:/MyPressur
e
</DataStructure>
Data Structure Attribute Summary
<DataStructure
Name= "Any name " Some meaningful name to the
owner
Rank="NumberOfDimensions" Redundant
information
Dimensions="Kdim Jdim Idim" The slowest varying
dimension is listed first
Type="Char | Float | Int | Compound" Default is
Float
Precision="BytesPerElement" Default is 4
Format="XML | HDF" Default is XML
>
XDMF Array Types

XDMF array entries can have these
types:
• Integer
• Float
• Char

All are 4 bytes by default, can be
increased to 8 bytes.
DataTransform

DataTransform defines a way for the
raw data to be filtered
• Gives a certain Area of Interest in data
set.

Possible transforms:
• Coordinate: Select an particular area
• Function: Define simple algorithm for
selecting area
• Hyperslab: Define start, stride, and
count for each dimension of an array.
Hyperslab Transform Example


The following markup instructs the processing
code to apply an Hyperslab transform to a 4-D
array.
The first data structure defines the hyperslab:
• 0000 are the starting points for each dim
• 2221 are strides (step sizes) for each dim
• 25 50 75 3 are the number of steps for each dim


The second data structure gives the raw data, a
100x200x300x3 array in the noted HDF5 file.
The resulting region starts at [0,0,0,0], ends at
[50,100,150,2] and includes every other plane of
the untransformed data.
Hyperslab Transform Example
<DataTransform
Dimensions="25 50
75 3"
Type="HyperSlab">
<DataStructure
Dimensions="3 4"
Format="XML">
0 0 0 0 2 2 2 1 25
50 75 3
</DataStructure>
<DataStructure
Name="Points"
Dimensions="100 200
300 3"
Format="HDF">
MyData.h5:/XYZ
</DataStructure>
</DataTransform>
Function Example
<DataTransform
<DataStructure
Type="Function"
Dimensions="2 3“
Function="( $0 + .022 ) * (
Format="XML“>
$1 / 2.0 )"
234432
Dimensions="2 3">
</DataStructure>
<DataStructure
</DataTransform>
Dimensions="10 20“
Format="XML"> 1.1
1.2 1.3 2.1 2.2 2.3
</DataStructure>
Explanation of Function Example




The function defines a simple data
transform that creates a new data set
from the existing ones.
In the example, the function takes
elements one at a time from the first ($0)
and second ($1) sets.
First resulting value:
• (1.1+0.22)*(2/2.0)=1.32
Second resulting value:
• (1.2+0.22)*(3/2.0)=2.13
Data Organization

DataStructure and DataTransform constitute
XDMF’s data representation.
• This specify raw data up to array structure


XDMF Domain tags are used as arbitrary
containers.
Domains contain Grids which specify data
model;
• Grids contain Topology’s, Geometry’s and Attributes,
as well as Datastructures.
• Topology specifies connectivity between points
• Geometry specifies points

Attributes include Scalars, Vectors, Tensors and
specify field values
A Full XDMF Example
<Domain Name="Example #1">
<Grid Name="My Hex Grid
with Pressure">
<Topology
Type="Hexahedron"
Dimensions="100"
Order="7 6 5 4 3 2 1 0">
<DataStructure
Name="Connections"
Type="Int"
Precision="8"
Dimensions="100 8" >
MyData.h5:/MyConns
</DataStructure>
</Topology>
(continued in next column)
<Geometry Type="XYZ">
<DataStructure Name="XYZ
Data"
Type="Float"
Dimensions="655 3">
MyData.h5:/MyXYZdata
</DataStructure>
</Geometry>
<Attribute Type="Scalar“
Center="Cell">
<DataStructure
Name="Pressure"
Type="Float"
Precision="8"
Dimensions="100">
MyData.h5:/MyPressure
</DataStructure>
</Attribute>
</Grid>
</Domain>
Review of Example

Recall XDMF is primarily for structured and
unstructured finite element grids.
• Input data includes grid connectivity info, grid
geometry, and pressure values



The Domain contains a Grid
The Grid is defined by Topology,
Geometry, and Attributes.
Topology, Attributes, and Geometry
contain data sources and structure info.
XDMF API


Like XSIL, XDMF treats the XML markup
as a set of instructions to be processed by
actual programs.
XDMF defines an API of document
processing engines.
• Core is in C++
• ICE also provides Java and TCL APIs through
wrappers around core.

See
http://www.arl.hpc.mil/ice/Examples/Code
Integration/DemoIceRt.cxx for code
example.
XDMF Summary


Provides a few general purpose tags
Again, data is not directly marked
up.
• Stored in HDF5


XDMF handled programmatically with
APIs in C++, Java, Tcl.
More information:
• http://www.arl.hpc.mil/ice/
Comparison of XSIL and XDMF

XSIL
• Larger tag set
• Java API
• Can read data that
is in document, on
disk, from URL
• Questionable
performance and
memory efficiency
for very large data
sets.
• Free and open
source

XDMF
• Uses HDF5 for large
data sets.
• C++, Java, TCL
APIs.
• Defines both data
structures and
transform
instructions.
• Supports arrays,
but not mixed data
types (such as XSIL
Tables).
• Integrated with ICE
Chemical Markup
Language
A domain specific XML
markup language.
CML Introduction




XSIL and XDMF use XML to describe code
input files and give simple processing
instructions.
Tags describe data structure, not content.
We now examine a domain specific
example, the Chemical Markup Language.
Other domain markup languages:
• Mathematics Markup Language (MathML)
• Geography Markup Language (GML)
XML for Chemistry

Goal: provide a common chemical data
format that is an open, universal
standard.
• Data representation is platform independent
• Support structured searches of data banks.
• Provide a common format for software
(particularly visualization).
• Support multidisciplinary data formats
(biology, math) through XML namespaces.
• Provide a data object hierarchy suitable for
object oriented programming.
CML Structure

Chemistry lends itself to object
container structure
• Atoms have protons, neutrons,
electrons
• Molecules have atoms
• Complex molecules and compounds are
composed of molecules, molecular
pieces (benzene rings, for example)

CML defines these as data objects
with property fields
A Simple Example: Glycine
<molecule convention="MDLMol"
id="glycine" title="GLYCINE">
<date day="22" month="11"
year="1995">
</date>
<atomArray>
<atom id="a1">
<string
builtin="elementType">
C</string>
<float
builtin="x2">0.6424</float>
<float
builtin="y2">0.4781</float>
</atom>
….
</atomArray>
<bondArray>
<bond id="b1">
<string
builtin="atomRef">a1</stri
ng>
<string
builtin="atomRef">a2</stri
ng>
<string
builtin="order">1</string>
</bond>
….
</bondArray>
</molecule>
CML Example
Interface
Previous Slide

Browser tool, Jumbo-3.0
• User can display dozens of CML’d
molecules.
• Molecules can by rotated in display.
• Display is rendered in SVG (Adobe
plugin for XML based 2D graphics).
• Molecule displayed is cholesterol. They
also have glycine in database, but not
as exciting to look at.
Gateway Application
Descriptors
Describing scientific applications
themselves with XML and
mapping to Java with Castor.
http://www.gatewayportal.org
Gateway Application Descriptors



Gateway is a computational web
portal for securely submitting and
monitoring jobs, transferring files,
and archiving information.
Gateway describes scientific
applications and host computers with
XML metadata.
This is used to provide general
purpose tools that can be used to
build portals for specific applications.
Application Descriptors



Gateway describes scientific applications and
host machines in XML.
This is used to generate HTML forms needed to
collect information needed to create batch
queuing scripts and job submission.
The general object container scheme is
• Portals contain applications
• Applications contain hosts
• Each also has a set of descriptive
parameters.
Example: ANSYS running on grids
<Application>
<ApplicationName>ANSYS
</ApplicationName>
<Version>5.0</Version>
<Parameter Name="IOStyle">
<Value>StandardIO</Value>
</Parameter>
<Parameter
Name="NumberOfInFiles">
<Value>1</Value>
</Parameter>
(continued on next column)
<Host>
<HostName>
grids.ucs.indiana.edu
</HostName>
<HostIP>156.56.103.5</HostIP>
<RemoteCopy>rcp
</RemoteCopy>
<RemoteExec>rsh</RemoteExec>
<WorkDir>/tmp</WorkDir>
<QueueType>CSH</QueueType>
<QsubPath>/usr/bin/csh
</QsubPath>
<ExecPath>echo
</ExecPath>
</Host>
</Application>
Java Data Object Bindings



As with other examples, the
descriptor does not do anything by
itself.
Must provide language bindings to
make it useful in programs.
We used Castor
(http://castor.exolab.org) to
generate classes for us.
Castor for Data Object Creation





Direct mapping between Application tag and
Java object, for example.
Each object has necessary getter and setter
methods for manipulating data.
After making classes from XML schema
(once), load in XML file to program to create
particular Java data object instances
(unmarshalled)
When program is done, modified data objects
can be marshalled back into XML file format.
We still have to write the Java code for
specific uses, utility classes….
Other markup languages
and some comparison
Various shortcomings of
programming and markup
languages
XML Schema

XML Schema defines many built-in
types
• binary, boolean, byte, decimal, double,
float, int, long, short, string
• And many more

Does not define standards for
• Arrays
• Complex (real+imaginary) numbers
SOAP




Known as XML Remote Procedure Call protocol.
• RPC is only one part of SOAP
Also defines encoding rules for data exchange.
SOAP inherits all XML Schema Built-in Types
(see previous slide).
Defines additional compound types
• Struct: arbitrary collection of types (say,
strings and floats) similar to XSIL table entry.
• Array: can contain primitive and compound
types
 A multi dimensional array can be built out
of arrays.
HDF5 and XML

Types include
• Integers

2-64 bit, signed or unsigned, big or little
endian
• Floats (32, 64 bit, BE or LE)
• Strings
• Arrays


Arbitrary compound types
See
http://hdf.ncsa.uiuc.edu/HDF5/XML/
Compatibility and Missing Features

No standard XML definitions for arrays
and “compound types” like XSIL tables.
• We have several defs: SOAP, XSIL, XDMF,
XML-HDF5

Lack of built-in support for complex
(real + imaginary) types
• XML, XML-HDF5, XDMF can easily define
complex but not standard in agreed
fashion.
• Java does not have built-in complex type,
either. Java does not efficient Fortran arrays
as well (Arrays are objects)
More Missing Features

Varying support for integers, floats with
different sizes.
• C/C++ does not guarantee consistent bit
size.

Binary data must specify Big
Endian/Little Endian encoding for cross
platform compatibility.
• XML-HDF5, XSIL, XDMF all do this
• XML does not

XSIL does not have signed/unsigned
HDF5 Overview
The following slides present a
very simple overview of HDF5
basic concepts.
HDF5 Overview




HDF5 is a general purpose file format
and library for working with scientific
data (vector arrays, structured and
unstructured grids, etc).
HDF5 comes with C, Fortran, and
Java libraries and utilities.
The following slides are a very
simplistic overview of HDF5.
See http://hdf.ncsa.uiuc.edu/HDF5/
HDF Basic Concepts




Two primary object types: groups
and datasets
Groups are a collection of HDF5
objects and descriptive metadata.
Datasets are multidimensional arrays
of data elements, plus metadata.
Both groups and datasets can have
user-defined attributes
HDF Datasets




Datasets are composed of datatypes
and dataspaces.
A dataset’s datatype defines the type
of the array entries (array of
integers, floats, etc.).
Dataspace defines the dimensionality
Finally, we assign some metadata
Example HDF5 File Using Datasets
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE { H5T_STD_I32BE }
DATASPACE { SIMPLE ( 4, 6 ) / ( 4, 6 ) }
DATA {
0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0
}
Obviously this can be
}
Done in XML syntax
}
http://hdf.ncsa.uiuc.edu/HDF5/XML
}
Explanation of Previous Slide



This file would be created using HDF5
library calls in a C, Fortran or Java
probram.
The file is named “dset.h5”.
It is part of the root HDF5 group (“/”)
• used to internally organize datasets and
map to file system locations.


This is a 4x6 array
The elements are 32 bit Big Endian
Integers.
HDF5 Attributes



Attributes are small datasets that can be
attached to an object (like the previous
dataset).
• Attributes use the same format as the
previous dataset example.
• They could be appended to previous dataset
file.
Attributes describe the nature or intended use
of the associated object.
HDF5 Libraries calls can read/write attributes
• Example: you might want to read in the
small attribute description before loading a
large array.
HDF5 Groups

Groups organize other HDF5 objects.
• A simple group might contain one or more
dataset objects such as in the previous
example.


Groups can also contain other groups, so
you can build up hierarchies of data.
Group names are organized like Unix
directories
• / is your root group.
• /foo/dset2 might be a dataset in the group foo.

No access control or security is applied
• Groups are for organization only.