IU-OREChemOct14.ppt

Download Report

Transcript IU-OREChemOct14.ppt

IU ORE-Chem Update
Marlon Pierce, Geoffrey Fox
Indiana University
IU to lead New US NSF Track 2d $10M Award
See http://www.futuregrid.org for more information.
What We Said We Would Do
• Apply data-centric workflow technologies (Dryad)
– Significant effort
• Install and run triple store
– Done locally.
– Need to do this in Azure.
• Design alternative formats for ORE (JSON, Microformats)
– Nothing to report yet
• Design secure services, compositions, mash-ups
– OAuth piece done.
– Significant effort on social network interfaces
– Nothing to report on ORE-chem enabled services yet
• Investigate clouds for ORE-Chem
– Infrastructure and runtime
– Significant effort on virtual data stores, overheads of virtualization.
Layer Cake of IU Activities
Web 2.0 Research: Security for REST
Services
Cloud Computing: Infrastructure and
Runtimes
Infrastructure: Windows HPC Testbeds
Hardware Layer
Cloud Infrastructure
• Tempest: HP distributed shared memory cluster with
768 processor cores and 1.5 TB total memory
capacity. The cluster includes 13.7 TB of local
spinning disk.
– Tempest can be dynamically reconfigured to act as either a Windows
HPC or Linux cluster.
– Smaller versions Madrid and Barcelona
• Other machines:
– The IBM iDataPlex system is an IBM e1350 distributed shared memory
cluster with 1024 processor cores and 3 TB total memory capacity.
– Cray XT5m distributed shared memory cluster with 672 processor
cores and 1.3 TB total memory capacity.
– A shared memory system with at least 480 cores and 640 GB of RAM
will also be installed at IU as part of the FutureGrid award.
Data Cloud Infrastructure
Triple Store: Intellidimension
• This has been installed on IU servers.
• We are ready for data.
• Efforts to install this on MS Azure
were not successful.
–Inadequate documentation earlier in
the year.
–We will revisit this.
Open Elastic Block Store
• Amazon EBS is a way to mount virtual disks in cloudspace.
– Empty disk space or archived data stores
– ORECHEM enabled data sets, for example.
– Clone-able, so keep your own version of community data.
• We are implementing an open version of this.
– Contribute to Nimbus, an open-source EC2
– But independent of Xen, etc.
– Would be interesting to do this for Windows
• Eventual backbone: IU has over a petabyte disk space
of lustre file system.
– Can be used to load and store VMs.
• X. Gao won best student poster award at TG09.
– Paper accepted to E-Science 2009
Block Store Architecture
Volume Server
iSCSI
Volume
Delegate
Virtual Machine
Manager (Xen Dom 0) VBD
VMM
Delegate
Create Volume,
Export Volume,
Create Snapshot,Etc.
VBS Web
Service
VBS Client
Import Volume,
Attach Device,
Detach
Device,Etc.
VM
instance (
Xen Dom U)
Integration with Cloud Computing Systems
Volume Server
iSCSI
VBD
Xen Dom U
Xen
Delegate
Volume
Delegate
Create Volume,
Export Volume,
Create
Snapshot,Etc.
Xen Dom 0
Import Volume,
Attach Device,
Detach Device,Etc.
VBS Web
Service
Nimbus
Workspace
Service
Query for Xen Dom0
Host and DomUId
with <Nimbus
Instance Id>
VBS_Nimbus
Web Service
Attach-volume <volId>
<Nimbus Instance Id> <device>
VBS
Client
Programming Clouds
Multicore and Cloud Technologies to
support Data Intensive applications
• Using Dryad (Microsoft) and MPI to study structure of Gene
Sequences on Tempest Cluster. We are working on PubChem.
See
http://www.
infomall.org/
salsa for lab
projects (X.
Qiu).
Courtesy of
Jong Y. Choi
PubChem dataset consists of binary 166 MACCS keys (fingerprints), which indicate
whether a each chemical compound has a special functional molecule or not
We have total 26,466,421 chemical compounds. (i.e, the total PubChem dataset has
166 dimensions and 26M records)
Randomly selected 50K chemicals to produce 3D GTM map. GTM is an algorithm to
find a lower dimension structure from higher dimensional data (3D in this case).
http://www.youtube.com/watch?v=nylgjKgnSLg
IU’s ORE-CHEM Pipeline
Harvest NIH
PubChem for 3D
Structures
Convert
Gaussian
Output to CML
Convert CML to
RDF->OREChem
Convert
PubChem XML
to CML
Submit Jobs to
TeraGrid with
Swarm
Insert RDF into
RDF Triple Store
Convert CML to
Gaussian Input
Goal is to create a
public, searchable
triple store populated
with ORE-CHEM data
on drug-like
molecules.
Convert
PubChem XML
to CML
Conversions are done with Jumbo/CML tools from Peter Murray Rust’s
group at Cambridge. Swarm is a Web service capable of managing 10,000’s
of jobs on the TeraGrid. We are developing a Dryad version of the pipeline.
Iterative MapReduce- Kmeans Clustering and Matrix Multiplication
Overhead of parallel runtimes – Matrix Multiplication
• Compute intensive
application O(n^3)
• Higher data
transfer
requirements
O(n^2)
• CGL-MapReduce
shows minimal
overheads next to
MPI
Iterative MapReduce algorithm for
Matrix Multiplication
Overhead of parallel runtimes – Kmeans
Clustering
Kmeans Clustering implemented
as an iterative MapReduce
application
Jaliya Ekanayake {[email protected]}
• O(n) calculations
in each iteration
• Small data transfer
requirements O(1)
• With large data
sets, CGLMapReduce shows
negligible
overheads
• Extremely higher
overheads in
Hadoop and Dryad
High Performance Parallel Computing on Cloud
•
Performance of MPI on virtualized resources
–
–
–
–
Evaluated using a dedicated private cloud infrastructure
Exactly the same hardware and software configurations in bare-metal and virtual nodes
Applications with different communication: computation ratios
Different virtual machine(VM) allocation strategies {1-VM per node to 8-VMs per node}
Performance of Matrix multiplication under
different VM configurations
•
•
•
O(n^2) communication (n = dimension of a matrix)
More susceptible to bandwidth than latency
Minimal overheads under virtualized
resources
Overhead under different VM
configurations for Concurrent Wave
Equation Solver
•
•
•
O(1) communication (Smaller messages)
More susceptible to latency
Higher overheads under virtualized
resources
Jaliya Ekanayake {[email protected]}
Conclusions: Dryad for Scientific Computing
• Investigated several applications with various computation,
communication, and data access requirements
• All DryadLINQ applications work, and in many cases perform
better than Hadoop
• We can definitely use DryadLINQ (and Hadoop) for
scientific analyses
• We did not implement (find)
– Applications that can only be implemented using DryadLINQ but
not with typical MapReduce
• Current release of DryadLINQ has some performance
limitations
• DryadLINQ hides many aspects of parallel computing from
user
• Coding is much simpler in DryadLINQ than Hadoop
(provided that the performance issues are fixed)
• Key issue is support of inhomogeneous data
IU’s ORE-CHEM Pipeline
Harvest NIH
PubChem for 3D
Structures
Convert
Gaussian
Output to CML
Convert CML to
RDF->OREChem
Convert
PubChem XML
to CML
Submit Jobs to
TeraGrid with
Swarm
Insert RDF into
RDF Triple Store
Convert CML to
Gaussian Input
Goal is to create a
public, searchable
triple store populated
with ORE-CHEM data
on drug-like
molecules.
Convert
PubChem XML
to CML
Conversions are done with Jumbo/CML tools from Peter Murray Rust’s
group at Cambridge. Swarm is a Web service capable of managing 10,000’s
of jobs on the TeraGrid. We are developing a Dryad version of the pipeline.
Architecture of Swarm Service
Swarm-Analysis
Standard Web Service Interface
Large Task Load Optimizer
Swarm-Grid
Connector
Swarm-Dryad
Connector
Swarm-Hadoop
Connector
Local RDMBS
SwarmGrid
Grid HPC/
Condor Cluster
SwarmDryad
SwarmHadoop
Windows
Server
Cluster
Cloud Comp.
Cluster
Swarm-Grid
• Swarm considers
traditional Grid HPC
cluster are suitable for
the high-throughput
jobs.
Swarm-Grid
Standard Web Service Interface
QBETS
Web
Service
Hosted by UCSB
– Prioritizes the resources
with QBETS, INCA
Resource Ranking Manager
Data Model
Manager
Fault Manager
User A’s Job Board
Local
RDMBS
– Parallel jobs (e.g. MPI
jobs)
– Long running jobs
• Resource Ranking
Manager
Request Manager
Job Queue
Job Distributor
MyProxy
Server
Hosted by
TeraGrid Project
Grid HPC/Condor pool Resource
Connector
Condor(Grid/Vanilla) with Birdbath
• Fault Manager
– Fatal faults
– Recoverable faults
Grid HPC
Grid HPC
Clusters
Grid HPC
Clusters
Grid HPC
Clusters
Clusters
Condor
Cluster
Some Details
• We can submit jobs to 3 different TeraGrid
machines
– Abe, Mercury, Cobalt (all at NCSA)
– IU’s BigRed has some technical problems
• Can do about 100-200 molecules per day in
tests.
• Approach is fragile because
application/system admins have tendency to
change things every few months.
– Don’t validate Globus invocations of codes
Dryad Data Partitioning
• Two methods:
– Manually place the files in every node or
– Write a C# code that uses DryadLINQ partitioning
operators like Hash Partition<T,K> or Range
Partition<T,K>
• A partitioned data set consists of 2 types of files:
– A metadata file (.pt as extension) containing metadata
that
describes the
partitions
\DryadData\UserName\InputData
(file path
and name)
4
(number of partitions depending on number of nodes available)
0,2000,NODE01
(Partition files: Partition number, size(in bytes), node name : File path)
1,2000,NODE02,NODE03:FilePath
2,2000,NODE03,NODE04
3,2000,NODE04
– Set of partition files, one for each data partition.
Programming the Pipeline
• IQueryable<T> represents query over the data
• Input data is represented by a PartitionedTable<T> object
• DryadLINQ programs apply LINQ query operations to
PartitionedTable<T> objects.
• LINQ queries on the PartitionTable object are executed on
the Cluster.
IQueryable<LineRecord> filenames = PartitionedTable.Get<LineRecord>
(filepathuri);
IQueryable<outputinfo> outputs = filenames.Select(s => XML2CML(
execFileName, programSwitches, s.line, outputDirectory));
• Jobs will be executed on different nodes and the output
would be collected in the outputDirectory.
Security in Web 2.0
OAuth: REST Security
• This is actually a Year 2 deliverable but we made
progress in Year 1.
• OAuth is essentially security for REST.
– Provide authentication and authorization
– Relevant to ORE-CHEM services
• Use REST URL and HTTP method
– Resources are identified by URLs
– Access privileges are identified by HTTP methods
(GET, POST, PUT, DELETE)
• Extend OAuth
– Add finer-grained authorization information in
requests and responses.
OAuth Security Status
• OAuth *Core* Code provides the fundamental piece of OAuth
specification 1.0.
• Includes minimal webapp example
–
–
–
–
The sample web apps just support shared secret.
We extend to support PKI
Also fixed some bugs in the code.
To support OAuth extensions, more code is needed in OAuth core.
• For OpenID, we use library OpenID4Java and it seems to offer
enough functionalities so far.
• Tutorial given at TeraGrid09
• Slides: http://www.collabogce.org/ogce/images/3/39/OAuthOverview-TG09.ppt
• Code:
http://ogce.svn.sourceforge.net/viewvc/ogce/incubator/OGCEOAuth/
Acknowledgments
• Geoffrey Fox
• Judy Qiu and SALSA team: data mining
– www.infomall.org/salsa
• Jaliya Ekanayake: Dryad and Cloud
performance
• Sangmi Pallickara: Swarm service
• Xiaoming Gao: Virtual Block Store
• Zhenhua Guo: OAuth, OpenID, and Social
Networks
Full Stop
Dryad and DryadLINQ
Dryad is a high-performance, general-purpose distributed computing
engine that simplifies the task of implementing distributed
applications on clusters of computers running a windows operating
system.
DryadLINQ allows us to implement Dryad applications in managed
code by using an extended version of the LINQ programming model
and API. LINQ was introduced with Microsoft .NET framework
version 3.5.
DryadLINQ provider translates the application’s LINQ queries into a
Dryad job and runs the job as a distributed application on a windows
HPC cluster.
Dryad and DryadLINQ An Introduction version 1.0. June 30, 2009. by Microsoft
Research
Application
Basic
Execution
Plan
Distributed
Job
...
Data
Partitions
S
Processing
Code
S
S
S
S
...
HPC
Cluster
S
Output
Data
Dryad and DryadLINQ An Introduction version 1.0. June 30, 2009. by Microsoft Research
Client
Workstation
Dryad Cluster
JM
DryadLINQ
Provider
Head
Node
V
V
...
V
Compute Nodes
V - Vertices
JM - Dryad Job
Manager
DryadLINQ Programming Guide version 1.0. June 30, 2009. by Microsoft Research
Client Workstation: runs DryadLINQ application.
DryadLINQ Provider creates a Windows HPC job on the cluster
to handle the Dryad processing, receives the results, and
returns them to the application.
Job Manager: Windows HPC task that manages execution of
associated Dryad job on the cluster.
Head Node: manages the cluster, hosts the Windows HPC
Administration Console and Dryad management service.
Vertices: perform the data processing on the compute nodes.
java.exe +
jumboconverters.jar
xml -> cml
cml -> gaussian input
DryadLINQ
Provider
(LinqToDryad.dll)
Dryad
Cluster
Local Machine
Distribute
gaussian Input
files across the
cluster and run
gaussian.exe on
every file at every
node in the
cluster.
Distribute all the initial
xml files over the
cluster
Dryad Cluster
Stage1
Stage2
Stage3
xml to cml conversion
cml to gaussian
conversion
run gaussian on every
file
Drilling Though Data Clouds
Applications
 Cheminformatics: Mapping PubChem data into low dimensions to aid drug discovery
 Biology: Expressed Sequence Tag (EST) sequence assembly (CAP3)
 Biology: Pairwise Alu sequence alignment (SW)
 Health: Correlating childhood obesity with environmental factors
Data mining Algorithm
Visualization
Clustering (Pairwise , Vector)PlotViz
MDS, GTM, PCA, CCA
Cloud Technologies
Classic HPC
(MapReduce, Dryad, Hadoop) MPI
FutureGrid/VM/Virtual Storage
Bare metal
(Computer, network, storage)
Architecture and Performance of Runtime Environments
for Data Intensive Scalable Computing
CAP3 – Gene Assembly Program
• Compute intensive
application
• Embarrassingly
parallel operation
• All runtimes
performs equally well
Measured using 32
Compute nodes each with 8
cores and 16 GB of memory
Data/compute intensive applications
implemented as MapReduce “filters”
Number of Reads processed
High Energy Physics Data Analysis
• Data intensive
application
• MapReduce style
parallel operation
• Both runtimes perform
comparably well
Architecture of CGL-MapReduce
Jaliya Ekanayake {[email protected]}
Job Execution Time in Swarm-Dryad
with Windows HPC 16 nodes
Job Execution Time in Swarm-Dryad
various number of nodes