Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data System Orchestration / Dataflow / Workflow Archival Storage – NOSQL like.

Download Report

Transcript Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data System Orchestration / Dataflow / Workflow Archival Storage – NOSQL like.

Cloud DIKW based on HPC-ABDS to
integrate streaming and batch Big Data
System Orchestration / Dataflow / Workflow
Archival Storage – NOSQL like Hbase
Batch Processing (Iterative MapReduce)
Raw
Data
Data
Information
Knowledge
Wisdom
Decisions
Streaming Processing (Iterative MapReduce)
Storm
Storm
Storm
Storm
Pub-Sub
Internet of Things (Smart Grid)
Storm
Storm
System Orchestration / Dataflow / Workflow
Archival Storage – Accumulo
Batch Processing (MapReduce)
Raw
Data
Data
Information
Knowledge
Wisdom
Decisions
Streaming Processing (Bolts)
Storm
Storm
Storm
Storm
Pub-Sub
Data Ingest
Storm
Storm
Big Data
HPC
Orchestration
Crunch, Tez, Cloud Dataflow
Kepler, Pegasus
Libraries
Mllib/Mahout, R, Python
Matlab, Eclipse, Apps
High Level Programming
Pig, Hive, Drill
Domain-specific Languages
Platform as a Service
App Engine, BlueMix, Elastic Beanstalk
XSEDE Software Stack
Languages
Java, Erlang, SQL, SparQL
Streaming
Parallel Runtime
Storm, Kafka, Kinesis
MapReduce
Coordination
Caching
Zookeeper
Memcached
Data Management
Data Transfer
Hbase, Neo4J
Sqoop
iRODS
GridFTP
Scheduling
Yarn
Slurm
File Systems
HDFS
Lustre
Formats
Thrift, Protobuf
FITS, HDF
Fortran, C/C++
Integrated
Software
Ecosystem
MPI/OpenMP/OpenCL
Big Data ABDS
HPC, Cluster
Orchestration
Crunch, Tez, Cloud Dataflow
Kepler, Pegasus
Libraries
Mllib/Mahout, R, Python
Matlab, Eclipse, Apps
High Level Programming
Pig, Hive, Drill
Domain-specific Languages
Platform as a Service App Engine, BlueMix, Elastic Beanstalk
XSEDE Software Stack
Languages
Java, Erlang, SQL, SparQL
Fortran, C/C++
Streaming
Parallel Runtime
Storm, Kafka, Kinesis
MapReduce
Coordination
Caching
Zookeeper
Memcached
Data Management
Data Transfer
Hbase, Neo4J, MySQL
Sqoop
iRODS
GridFTP
Scheduling
Yarn
Slurm
File Systems
HDFS, Object Stores
Lustre
Formats
Thrift, Protobuf
Virtualization
OpenStack
Docker, SR-IOV
Infrastructure
CLOUDS
SUPERCOMPUTERS
HPC-ABDS
Integrated
Software
MPI/OpenMP/OpenCL
FITS, HDF
Big Data ABDS
HPC, Cluster
17. Orchestration
Crunch, Tez, Cloud Dataflow
Kepler, Pegasus, Taverna
16. Libraries
MLlib/Mahout, R, Python
ScaLAPACK, PETSc, Matlab
15A. High Level Programming Pig, Hive, Drill
Domain-specific Languages
15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk
Languages
Java, Erlang, Scala, Clojure, SQL, SPARQL, Python
14B. Streaming
Storm, Kafka, Kinesis
13,14A. Parallel Runtime Hadoop, MapReduce
2. Coordination
12. Caching
Zookeeper
Memcached
HPC-ABDS
Integrated
Software
XSEDE Software Stack
Fortran, C/C++, Python
MPI/OpenMP/OpenCL
CUDA, Exascale Runtime
11. Data Management Hbase, Accumulo, Neo4J, MySQL
10. Data Transfer
Sqoop
iRODS
GridFTP
9. Scheduling
Yarn
Slurm
8. File Systems
HDFS, Object Stores
Lustre
1, 11A Formats
Thrift, Protobuf
5. IaaS
OpenStack, Docker
Linux, Bare-metal, SR-IOV
Infrastructure
CLOUDS
SUPERCOMPUTERS
FITS, HDF
Big Data ABDS
HPC, Cluster
Orchestration
Crunch, Tez, Cloud Dataflow
Kepler, Pegasus, Taverna
Libraries
MLlib/Mahout, R, Python
ScaLAPACK, PETSc, Matlab
High Level Programming Pig, Hive, Drill
Domain-specific Languages
Platform as a Service App Engine, BlueMix, Elastic Beanstalk
Languages
Java, Erlang, Scala, Clojure, SQL, SPARQL, Python
Initial
Convergence
Software
XSEDE Software Stack
Fortran, C/C++, Python
Streaming
Parallel Runtime
Storm, Kafka, Kinesis
Hadoop, MapReduce
Coordination
Caching
Zookeeper
Memcached
Data Management
Data Transfer
Hbase, Accumulo, Neo4J, MySQL
Sqoop
iRODS
GridFTP
Scheduling
Mesos, Aurora, Yarn
Slurm
File Systems
HDFS, Object Stores
Lustre
Formats
Thrift, Protobuf
IaaS
OpenStack, Docker
Linux, Bare-metal, SR-IOV
Infrastructure
CLOUDS
SUPERCOMPUTERS
MPI/OpenMP/OpenCL
CUDA, Exascale Runtime
FITS, HDF
Big Data ABDS
HPC, Cluster
17. Orchestration
Crunch, Tez, Cloud Dataflow
Kepler, Pegasus, Taverna
16. Libraries
Mllib/Mahout, R, Python
ScaLAPACK, PETSc, Matlab
15A. High Level Programming Pig, Hive, Drill
Domain-specific Languages
15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk
Languages
Java, Erlang, Scala, Clojure, SQL, SPARQL, Python
14B. Streaming
Storm, Kafka, Kinesis
13,14A. Parallel Runtime MapReduce
2. Coordination
12. Caching
Zookeeper
Memcached
HPC-ABDS
Integrated
Software
XSEDE Software Stack
Fortran, C/C++, Python
MPI/OpenMP/OpenCL
CUDA, Exascale Runtime
11. Data Management Hbase, Neo4J, MySQL
10. Data Transfer
Sqoop
iRODS
GridFTP
9. Scheduling
Yarn
Slurm
8. File Systems
HDFS, Object Stores
Lustre
1, 11A Formats
Thrift, Protobuf
5. IaaS
OpenStack, Docker
Linux, Bare-metal, SR-IOV
Infrastructure
CLOUDS
SUPERCOMPUTERS
FITS, HDF
Big Data ABDS
HPC, Cluster
17. Orchestration
Crunch, Tez, Cloud Dataflow
Kepler, Pegasus, Taverna
16. Libraries
MLlib/Mahout, R, Python
ScaLAPACK, PETSc, Matlab
15A. High Level Programming Pig, Hive, Drill
Domain-specific Languages
15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk
Languages
Java, Erlang, Scala, Clojure, SQL, SPARQL, Python
14B. Streaming
Storm, Kafka, Kinesis
13,14A. Parallel Runtime Hadoop, MapReduce
2. Coordination
12. Caching
Zookeeper
Memcached
HPC-ABDS
Integrated
Software
XSEDE Software Stack
Fortran, C/C++, Python
MPI/OpenMP/OpenCL
CUDA, Exascale Runtime
11. Data Management Hbase, Accumulo, Neo4J, MySQL
10. Data Transfer
Sqoop
iRODS
GridFTP
9. Scheduling
Yarn
Slurm
8. File Systems
HDFS, Object Stores
Lustre
1, 11A Formats
Thrift, Protobuf
5. IaaS
OpenStack, Docker
Linux, Bare-metal, SR-IOV
Infrastructure
CLOUDS
SUPERCOMPUTERS
FITS, HDF
Big Data ABDS
HPC, Cluster
Orchestration
Crunch, Tez, Cloud Dataflow
Kepler, Pegasus, Taverna
Libraries
MLlib/Mahout, R, Python
ScaLAPACK, PETSc, Matlab
High Level Programming Pig, Hive, Drill
Domain-specific Languages
Platform as a Service App Engine, BlueMix, Elastic Beanstalk
Languages
Java, Erlang, Scala, Clojure, SQL, SPARQL, Python
Initial
Convergence
Software
XSEDE Software Stack
Fortran, C/C++, Python
Streaming
Parallel Runtime
Storm, Kafka, Kinesis
Hadoop, MapReduce
Coordination
Caching
Zookeeper
Memcached
Data Management
Data Transfer
Hbase, Accumulo, Neo4J, MySQL
Sqoop
iRODS
GridFTP
Scheduling
Mesos, Aurora, Yarn
Slurm
File Systems
HDFS, Object Stores
Lustre
Formats
Thrift, Protobuf
IaaS
OpenStack, Docker
Linux, Bare-metal, SR-IOV
Infrastructure
CLOUDS
SUPERCOMPUTERS
MPI/OpenMP/OpenCL
CUDA, Exascale Runtime
FITS, HDF
4 Forms of MapReduce
(1) Map Only
(2) Classic
MapReduce
Input
Input
(3) Iterative Map Reduce (4) Point to Point or
or Map-Collective
Map-Communication
Input
Iterations
map
map
map
Local
reduce
reduce
Output
Graph
BLAST Analysis
Local Machine
Learning
Pleasingly Parallel
High Energy Physics
(HEP) Histograms
Distributed search
Recommender Engines
Expectation maximization
Clustering e.g. K-means
Linear Algebra,
PageRank
MapReduce and Iterative Extensions (Spark, Twister)
Classic MPI
PDE Solvers and
Particle Dynamics
Graph Problems
MPI, Giraph
Integrated Systems such as Hadoop + Harp with
Compute and Communication model separated
Correspond to first 4 of Identified Architectures
(5) Map Streaming
maps
brokers
(6) Shared memory
Map Communicates
Shared Memory
Map &
Communicate
Events
6 Data Analysis Architectures
(6) Shared memory
Map Communicates
Shared Memory
Difficult to parallelize
asynchronous
parallel
Graph Algorithms
Classic Hadoop in classes 1) 2)
(1) Map Only
Input
(3) Iterative Map Reduce (4) Point to Point or
or Map-Collective
Map-Communication
(2) Classic
MapReduce
Input
Input
Map &
Communicate
(5) Map Streaming
maps
Iterations
brokers
map
map
map
Local
reduce
reduce
Output
BLAST Analysis
Local Machine
Learning
Pleasingly
Parallel
Graph
High Energy
Physics (HEP)
Histograms
Web search
Expectation
maximization
Clustering Linear
Algebra, PageRank
Recommender Engines
MapReduce and Iterative Extensions (Spark, Twister)
Harp – Enhanced Hadoop
Events
Classic MPI
PDE Solvers
and Particle
Dynamics
Graph
Streaming images
from Synchrotron
sources,
Telescopes,
IoT
MPI, Giraph
Apache Storm
Maps are Bolts
(5) Map Streaming
maps
brokers
(6) Shared memory
Map Communicates
Shared Memory
Map &
Communicate
Events
1000000 points
50000 centroids
10000000 points
5000 centroids
100000000 points
500 centroids
10000
1000
Time
(in sec)
100
10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
24
48
96
●
●
●
●
0.1
●
24
48
96
24
48
96
Number of Cores
Hadoop MR
Mahout
Python Scripting
Spark
Harp
MPI
Effi−
ciency
1
1.0
●
●
●
●
●
●
1 48
96
●
24
48
96
Effi−
ciency
i−
cy
●
●
●
●
24
er of Cores
on Scripting
48
96
Spark
Harp
Hadoop MR
10000
Time
Secs
24
MPI
48
96
24
48
Number of Cores
1000000Python
points
Mahout
Scripting
50000 centroids
Kmeans Clustering
Spark
96
10000000
points
MPI
5000 centroids
10000
500
Harp
1000
100
10
1
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.1 Efficiency
24
●
●
48
# Cores
96
24
48
96
Number of Cores
24
Software-Defined Distributed
System (SDDS) as a Service includes
Dynamic Orchestration and Dataflow
Software
(Application
Or Usage)
SaaS
Platform
PaaS
 Use HPC-ABDS
 Class Usages e.g. run
GPU & multicore
 Applications
 Control Robot
 Cloud e.g. MapReduce
 HPC e.g. PETSc, SAGA
 Computer Science e.g.
Compiler tools, Sensor
nets, Monitors
Infra  Software Defined
Computing (virtual Clusters)
structure
IaaS
Network
NaaS
 Hypervisor, Bare Metal
 Operating System
 Software Defined
Networks
 OpenFlow GENI







SDDS-aaS Tools
Provisioning
Image Management
IaaS Interoperability
NaaS, IaaS tools
Expt management
Dynamic IaaS NaaS
DevOps
CloudMesh is a
SDDSaaS tool that uses
Dynamic Provisioning and
Image Management to
provide custom
environments for general
target systems
Involves (1) creating,
(2) deploying, and
(3) provisioning
of one or more images in
a set of machines on
demand
http://mycloudmesh.org/
17