High Performance Data Streaming in a Service Architecture Jackson State University Internet Seminar November 18 2004 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN.

Download Report

Transcript High Performance Data Streaming in a Service Architecture Jackson State University Internet Seminar November 18 2004 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN.

High Performance Data
Streaming in a Service
Architecture
Jackson State University
Internet Seminar
November 18 2004
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]
http://www.infomall.org http://www.grid2002.org
1
Abstract




We discuss a class of HPC applications characterized
by large scale simulations linked to large data streams
coming from sensors, data repositories and other
simulations.
Such applications will increase in importance to
support "data-deluged science”.
We show how Web service and Grid technologies offer
significant advantages over traditional approaches
from the HPC community.
We cover Grid workflow (contrasting it with dataflow)
and how Web Service (SOAP) protocols can achieve
high performance
2
Parallel Computing


Parallel processing is built on breaking problems up
into parts and simulating each part on a separate
computer node
There are several ways of expressing this breakup into
parts with Software:
• Message Passing as in MPI or
• OpenMP model for annotating traditional languages
• Explicitly parallel languages like High Performance Fortran

And several computer architectures designed to
support this breakup
• Distributed Memory with or without custom interconnect
• Shared Memory with or without good cache
• Vectors with usually good memory bandwidth
3
The Six Fundamental MPI routines
 MPI_Init
(argc, argv) -- initialize
 MPI_Comm_rank (comm, rank) -- find process
label (rank) in group
 MPI_Comm_size(comm, size) -- find total number
of processes
 MPI_Send (sndbuf,count,datatype,dest,tag,comm)
-- send a message
 MPI_Recv
(recvbuf,count,datatype,source,tag,comm,status) - receive a message
 MPI_Finalize( ) -- End Up
4
Whatever the Software/Parallel
Architecture …..

The software is a set of linked parts
• Threads, Processes sharing the same memory or independent programs on
different computers


And the parts must pass information between them in to
synchronize themselves and ensure they really are working the
same problem
The same of course is true in any system
• Neurons pass electrical signals in the brain
• Humans use a variety of information passing schemes to build
communities: voice, book, phone
• Ants and Bees use chemical messages

Systems are built of parts and in interesting systems the parts
communicate with each other and this communication expresses
“why it is a system” and not a bunch of independent bits
5
A Picture from 20 years ago
6
Passing Information



Information passing between parts covers a wide range
in size (number of bits electronically) and “urgency”
Communication Time = Latency + (Information
Size)/Bandwidth
From Society we know that we choose multiple
mechanisms with different tradeoffs
• Planes and high latency and bandwidth
• Walking is low latency but low bandwidths
• Cars are somewhat in between theses cases

We can always think of information being transferred
as a message
• If airplane passenger, sound waves or a posted letter
• Whether if an MPI message or UNIX Pipe between processes
or a method call between threads
7
Parallel Computing and Message Passing



We worked very hard to get a better programming
model for parallel computing that removed need for
user to
• Explicitly decompose problem and derive parallel
algorithm for decomposed parts
• Write MPI programs expressing explicit
decomposition
This effort wasn’t so successful and on distributed
memory machines (including BlueGene/L) at least
message passing of MPI style is the execution model
even if one uses a higher level language
So for parallelism, we are forced to use message passing
and this is efficient but intellectually hard
8
The Latest Top 5 in Top500
9
What about Web Services?
• Web Services are distributed computer programs
that can be in any language (Fortran .. Java .. Perl ..
Python)
• The simplest implementations involve XML
messages (SOAP) and programs written in net
friendly languages like Java and Python
• Here is a typical e-commerce use?
Payment
WSDL interfaces
Security
WSDL interfaces
Credit Card
Catalog
Warehouse
shipping 10
Internet Programming Model







Web Services are designed as the latest distributed computing
programming paradigm motivated by the Internet and the
expectation that enterprise software will be built on the same
software base
Parallel Computing is centered on DECOMPOSITION
Internet Programming is centered on COMPOSITION
The components of e-commerce (catalog, shipping, search,
payment) are NATURALLY separated (although they are often
mistakenly integrated in older implementations)
These same components are naturally linked by Messages
MPI is replaced by SOAP and the COMPOSITION model is
called Workflow
Parallel Computing and the Internet have the same execution
model (processes exchanging messages) but very different
REQUIREMENTS
11
Requirements for MPI Messaging
tcalc

tcomm
tcalc
MPI and SOAP Messaging both send data from a source to a
destination
• MPI supports multicast (broadcast) communication;
• MPI specifies destination and a context (in comm parameter)
• MPI specifies data to send
• MPI has a tag to allow flexibility in processing in source processor
• MPI has calls to understand context (number of processors etc.)

MPI requires very low latency and high bandwidth so that
tcomm/tcalc is at most 10
• BlueGene/L has bandwidth between 0.25 and 3
Gigabytes/sec/node and latency of about 5 microseconds
• Latency determined so Message Size/Bandwidth > Latency
12
BlueGene/L MPI I
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
13
BlueGene/L MPI II
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
14
BlueGene/L MPI III
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
500 Megabytes/sec
15
Requirements for SOAP Messaging

Web Services has much of the same requirements as MPI with
two differences where MPI more stringent than SOAP
• Latencies are inevitably 1 (local) to 100 milliseconds which is
200 to 20,000 times that of BlueGene/L





1) 0.000001 ms
– CPU does a calculation
2) 0.001 to 0.01 ms – MPI latency
3) 1 to 10 ms
– wake-up a thread or process
4) 10 to 1000 ms – Internet delay
• Bandwidths for many business applications are low as one
just needs to send enough information for ATM and Bank to
define transactions
SOAP has MUCH greater flexibility in areas like security, faulttolerance, “virtualizing addressing” because one can run a lot of
software in 100 milliseconds
• Typically takes 1-3 milliseconds to gobble up a modest
message in Java and “add value”
16
Ways of Linking Software Modules
Closely coupled Java/Python …
Module
B
Module
A
Coarse Grain Service Model
Service
B
.001 to 1 millisecond
METHOD CALL BASED
Messages
Service
A
0.1 to 1000 millisecond latency
MESSAGE BASED
EVENT BASED with brokered messages
“Listener”
Subscribe
to Events
Publisher
Post Events
Service B
Service A
Message Queue in the Sky
17
MPI and SOAP Integration


Note SOAP Specifies format and through WSDL
interfaces
MPI only specifies interface and so interoperability
between different MPIs requires additional work
• IMPI http://impi.nist.gov/IMPI/



Pervasive networks can support high bandwidth
(Terabits/sec soon) but latency issue is not resolvable in
general way
Can combine MPI interfaces with SOAP messaging but
I don’t think this has been done
Just as walking, cars, planes, phones coexist with
different properties; so SOAP and MPI are both good
and should be used where appropriate
18
NaradaBrokering


http://www.naradabrokering.org
We have built a messaging system that is designed to
support traditional Web Services but has an
architecture that allows it to support high performance
data transport as required for Scientific applications
• We suggest using this system whenever your application can
tolerate 1-10 millisecond latency in linking components
• Use MPI when you need much lower latency

Use SOAP approach when MPI interfaces required but
latency high
• As in linking two parallel applications at remote sites

Technically it forms an overlay network supporting in
software features often done at IP Level
19
Transit Delay (Milliseconds)
Mean transit delay for message samples in
NaradaBrokering: Different communication hops
9
8
7
6
5
4
3
2
1
0
hop-2
hop-3
hop-5
hop-7
100
1000
Message Payload Size (Bytes)
Pentium-3, 1GHz,
256 MB RAM
100 Mbps LAN
20
JRE 1.3 Linux
Standard Deviation for message samples in NaradaBrokering
Different communication hops - Internal Machines
0.8
hop-2
hop-3
hop-5
hop-7
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1000
1500
2000
2500
3000
3500
Message Payload Size
(Bytes)
4000
4500
5000
21
22
Average Video Delays for one broker –
divide by N for N load balanced brokers
Latency ms
Multiple
sessions
One session
30 frames/sec
# Receivers
23
NB-enhanced GridFTP
Adds Reliability and Web Service Interfaces to GridFTP
Preserves parallel TCP performance and offers choice of transport and
Firewall penetration
24
Role of Workflow
Service-1
Service-3
Service-2





Programming SOAP and Web Services (the Grid):
Workflow describes linkage between services
As distributed, linkage must be by messages
Linkage is two-way and has both control and data
Apply to multi-disciplinary, multi-scale linkage,
multi-program linkage, link visualization to
simulation, GIS to simulations and visualization
filters to each other
Microsoft-IBM specification BPEL is current
preferred Web Service XML specification of
workflow
25
Example workflow
Here a sensor feeds a datamining application
(We are extending datamining in DoD
applications with
Grossman from UIC)
The data-mining
application drives a
visualization
26
Example Flood Simulation workflow
Data
Archives
Runoff
Model
Flow
Model
Data
Archives
GIS Grid Services
Link Distributed
Data and
Applications
SOAP Messages
And Events
Runoff
Model
Flow
Model
Flow
Model
27
SERVOGrid Codes, Relationships
Elastic Dislocation Inversion
Viscoelastic FEM
Viscoelastic Layered BEM
Elastic Dislocation
Pattern Recognizers
Fault Model BEM
28
This linkage called Workflow in Grid/Web Service parlance
Two-level Programming I
• The Web Service (Grid) paradigm implicitly assumes a
two-level Programming Model
• We make a Service (same as a “distributed object” or
“computer program” running on a remote computer) using
conventional technologies
– C++ Java or Fortran Monte Carlo module
– Data streaming from a sensor or Satellite
– Specialized (JDBC) database access
• Such services accept and produce data from users files and
databases
Service
Data
• The Grid is built by coordinating such services assuming
we have solved problem of programming the service 29
Two-level Programming II




The Grid is discussing the composition of distributed
services with the runtime Service1
Service2
interfaces to Grid as
opposed to UNIX
Service3
Service4
pipes/data streams
Familiar from use of UNIX Shell, PERL or Python
scripts to produce real applications from core programs
Such interpretative environments are the single
processor analog of Grid Programming
Some projects like GrADS from Rice University are
looking at integration between service and composition
levels but dominant effort looks at each level separately
30
3 Layer Programming Model
Application
(level 1 Programming)
Application Semantics (Metadata, Ontology)
Level 2 “Programming”
MPI Fortran C++ etc.
Semantic Web
Basic Web Service Infrastructure
Web Service 1
WS 2
WS 3
WS 4
Workflow (level 3) Programming BPEL
Workflow will be built on top of NaradaBrokering as messaging layer
31
Structure of SOAP
• SOAP defines a very obvious message structure with a header
and a body just like email
• The header contains information used by the “Internet operating
system”
– Destination, Source, Routing, Context, Sequence Number …
• The message body is partly further information used by the
operating system and partly information for application when it
is not looked at by “operating system” except to encrypt,
compress it etc.
– Note WS-Security supports separate encryption for different parts of a
document
• Much discussion in field revolves around what is referenced in
header
• This structure makes it possible to define VERY Sophisticated
messaging
32
Deployment Issues for “System Services”



“System Services” (handlers/filters) are ones that act
before the real application logic of a service
They gobble up part of the SOAP header identified by
the namespace they care about and possibly part or all
of the SOAP body
• e.g. the XML elements in header from the WS-RM
namespace
They return a modified SOAP header and body to next
handler in chain
Header
Body
WS-RM
Handler
WS-……..
Handler
e.g. ……. Could be WS-Eventing WS-Transfer ….33
Fast Web Service Communication I
• Internet Messaging systems allow one to optimize message
streams at the cost of “startup time”,
• Web Services can deliver the fastest possible
interconnections with or without reliable messaging
• Typical results from Grossman (UIC) comparing Slow
SOAP over TCP with binary and UDP transport (latter gains
a factor of 1000)
Record
Count
SOAP/XML
Pure SOAP
WS-DMX/ASCII
SOAP
over UDP
WS-DMX/Binary
Binary
over UDP
MB
µ
σ/µ
MB
µ
σ/µ
MB
µ
σ/µ
10000
50000
150000
375000
1000000
5000000
0.93
4.65
13.9
34.9
93
465
2.04
8.21
26.4
75.4
278
7020
7020
6.45%
1.57%
0.30%
0.25%
0.11%
2.23%
0.5
2.4
7.2
18
48
242
1.47
1.79
2.09
3.08
3.88
8.45
0.61%
0.50%
0.62%
0.29%
1.73%
6.92%
0.28
1.4
4.2
10.5
28
140
1.45
1.63
1.94
2.11
3.32
5.60
5.60
0.38%
0.27%
0.85%
1.11%
0.25%
34
8.12%
Fast Web Service Communication II
• Mechanism only works for streams – sets of related
messages
• SOAP header in streams is constant except for
sequence number (Message ID), time-stamp ..
• One needs two types of new Web Service
Specification
• “WS-StreamNegotiation” to define how one can use
WS-Policy to send messages at start of a stream to
define the methodology for treating remaining
messages in stream
• “WS-FlexibleRepresentation” to define new
encodings of messages
35
Fast Web Service Communication III
• Then use “WS-StreamNegotiation” to negotiate stream in Tortoise
SOAP – ASCII XML over HTTP and TCP –
– Deposit basic SOAP header through connection – it is
part of context for stream (linking of 2 services)
– Agree on firewall penetration, reliability mechanism,
binary representation and fast transport protocol
– Naturally transport UDP plus WS-RM
• Use “WS-FlexibleRepresentation” to define encoding of a Fast
transport (On a different port) with messages just having
“FlexibleRepresentationContextToken”, Sequence Number, Time
stamp if needed
– RTP packets have essentially this structure
– Could add stream termination status
• Can monitor and control with original negotiation stream
36
• Can generate different streams optimized for different end-points
Data Deluged Science




In the past, we worried about data in the form of parallel I/O or
MPI-IO, but we didn’t consider it as an enabler of new
algorithms and new ways of computing
Data assimilation was not central to HPCC
DoE ASC set up because didn’t want test data!
Now particle physics will get 100 petabytes from CERN
• Nuclear physics (Jefferson Lab) in same situation
• Use around 30,000 CPU’s simultaneously 24X7




Weather, climate, solid earth (EarthScope)
Bioinformatics curated databases (Biocomplexity only 1000’s of
data points at present)
Virtual Observatory and SkyServer in Astronomy
Environmental Sensor nets
37
Weather Requirements
38
Data Deluged
Science
Computing
Paradigm
Data
Assimilation
Information
Simulation
Informatics
Model
Ideas
Computational
Science
Datamining
Reasoning
Virtual Observatory Astronomy Grid
Integrate Experiments
Radio
Far-Infrared
Visible
Dust Map
Visible + X-ray
Galaxy Density Map40
DAME Data Deluged Engineering
In flight data
~5000 engines
~ Gigabyte per aircraft per
Engine per transatlantic flight
Airline
Global Network
Such as SITA
Ground
Station
Engine Health (Data) Center
Maintenance Centre
Internet, e-mail, pager
Rolls Royce and UK e-Science Program
Distributed Aircraft Maintenance Environment
41
USArray
Seismic
Sensors
42
a
Site-specific Irregular
Scalar Measurements
Ice Sheets
Constellations for Plate
Boundary-Scale Vector
Measurements
a
a
Volcanoes
PBO
Greenland
Long Valley, CA
Topography
1 km
Stress Change
Northridge, CA
Earthquakes
Hector Mine, CA
43
OGSA-DAI
Grid Services
Grid
Grid Data
Assimilation
HPC
Simulation
Analysis
Control
Visualize
Data Deluged
Science
Computing
Architecture
Distributed Filters
massage data
For simulation
44
Data Assimilation

Data assimilation implies one is solving some optimization
problem which might have Kalman Filter like structure
Nobs
min
Theoretical Unknowns



2
Data
(
position
,
time
)

Simulated
_
Value
Error



i
i
2
i 1
Due to data deluge, one will become more and more dominated
by the data (Nobs much larger than number of simulation
points).
Natural approach is to form for each local (position, time)
patch the “important” data combinations so that optimization
doesn’t waste time on large error or insensitive data.
Data reduction done in natural distributed fashion NOT on
HPC machine as distributed computing most cost effective if
calculations essentially independent
• Filter functions must be transmitted from HPC machine
45
Distributed Filtering
Nobslocal patch >> Nfilteredlocal patch ≈ Number_of_Unknownslocal patch
In simplest approach, filtered data gotten by linear transformations on
original data based on Singular Value Decomposition of Least squares
matrix
Send needed Filter
Receive filtered data
Nobslocal patch 1
Nfilteredlocal patch 1
Geographically
Distributed
Sensor patches
Nobslocal patch 2
Factorize Matrix
to product of
local patches
Nfilteredlocal patch 2
Distributed
Machine
HPC Machine
46