Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data System Orchestration / Dataflow / Workflow Archival Storage – NOSQL like.
Download ReportTranscript Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data System Orchestration / Dataflow / Workflow Archival Storage – NOSQL like.
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data System Orchestration / Dataflow / Workflow Archival Storage – NOSQL like Hbase Batch Processing (Iterative MapReduce) Raw Data Data Information Knowledge Wisdom Decisions Streaming Processing (Iterative MapReduce) Storm Storm Storm Storm Pub-Sub Internet of Things (Smart Grid) Storm Storm System Orchestration / Dataflow / Workflow Archival Storage – Accumulo Batch Processing (MapReduce) Raw Data Data Information Knowledge Wisdom Decisions Streaming Processing (Bolts) Storm Storm Storm Storm Pub-Sub Data Ingest Storm Storm Big Data HPC Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus Libraries Mllib/Mahout, R, Python Matlab, Eclipse, Apps High Level Programming Pig, Hive, Drill Domain-specific Languages Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, SQL, SparQL Streaming Parallel Runtime Storm, Kafka, Kinesis MapReduce Coordination Caching Zookeeper Memcached Data Management Data Transfer Hbase, Neo4J Sqoop iRODS GridFTP Scheduling Yarn Slurm File Systems HDFS Lustre Formats Thrift, Protobuf FITS, HDF Fortran, C/C++ Integrated Software Ecosystem MPI/OpenMP/OpenCL Big Data ABDS HPC, Cluster Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus Libraries Mllib/Mahout, R, Python Matlab, Eclipse, Apps High Level Programming Pig, Hive, Drill Domain-specific Languages Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, SQL, SparQL Fortran, C/C++ Streaming Parallel Runtime Storm, Kafka, Kinesis MapReduce Coordination Caching Zookeeper Memcached Data Management Data Transfer Hbase, Neo4J, MySQL Sqoop iRODS GridFTP Scheduling Yarn Slurm File Systems HDFS, Object Stores Lustre Formats Thrift, Protobuf Virtualization OpenStack Docker, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS HPC-ABDS Integrated Software MPI/OpenMP/OpenCL FITS, HDF Big Data ABDS HPC, Cluster 17. Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna 16. Libraries MLlib/Mahout, R, Python ScaLAPACK, PETSc, Matlab 15A. High Level Programming Pig, Hive, Drill Domain-specific Languages 15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python 14B. Streaming Storm, Kafka, Kinesis 13,14A. Parallel Runtime Hadoop, MapReduce 2. Coordination 12. Caching Zookeeper Memcached HPC-ABDS Integrated Software XSEDE Software Stack Fortran, C/C++, Python MPI/OpenMP/OpenCL CUDA, Exascale Runtime 11. Data Management Hbase, Accumulo, Neo4J, MySQL 10. Data Transfer Sqoop iRODS GridFTP 9. Scheduling Yarn Slurm 8. File Systems HDFS, Object Stores Lustre 1, 11A Formats Thrift, Protobuf 5. IaaS OpenStack, Docker Linux, Bare-metal, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS FITS, HDF Big Data ABDS HPC, Cluster Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna Libraries MLlib/Mahout, R, Python ScaLAPACK, PETSc, Matlab High Level Programming Pig, Hive, Drill Domain-specific Languages Platform as a Service App Engine, BlueMix, Elastic Beanstalk Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Initial Convergence Software XSEDE Software Stack Fortran, C/C++, Python Streaming Parallel Runtime Storm, Kafka, Kinesis Hadoop, MapReduce Coordination Caching Zookeeper Memcached Data Management Data Transfer Hbase, Accumulo, Neo4J, MySQL Sqoop iRODS GridFTP Scheduling Mesos, Aurora, Yarn Slurm File Systems HDFS, Object Stores Lustre Formats Thrift, Protobuf IaaS OpenStack, Docker Linux, Bare-metal, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS MPI/OpenMP/OpenCL CUDA, Exascale Runtime FITS, HDF Big Data ABDS HPC, Cluster 17. Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna 16. Libraries Mllib/Mahout, R, Python ScaLAPACK, PETSc, Matlab 15A. High Level Programming Pig, Hive, Drill Domain-specific Languages 15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python 14B. Streaming Storm, Kafka, Kinesis 13,14A. Parallel Runtime MapReduce 2. Coordination 12. Caching Zookeeper Memcached HPC-ABDS Integrated Software XSEDE Software Stack Fortran, C/C++, Python MPI/OpenMP/OpenCL CUDA, Exascale Runtime 11. Data Management Hbase, Neo4J, MySQL 10. Data Transfer Sqoop iRODS GridFTP 9. Scheduling Yarn Slurm 8. File Systems HDFS, Object Stores Lustre 1, 11A Formats Thrift, Protobuf 5. IaaS OpenStack, Docker Linux, Bare-metal, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS FITS, HDF Big Data ABDS HPC, Cluster 17. Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna 16. Libraries MLlib/Mahout, R, Python ScaLAPACK, PETSc, Matlab 15A. High Level Programming Pig, Hive, Drill Domain-specific Languages 15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python 14B. Streaming Storm, Kafka, Kinesis 13,14A. Parallel Runtime Hadoop, MapReduce 2. Coordination 12. Caching Zookeeper Memcached HPC-ABDS Integrated Software XSEDE Software Stack Fortran, C/C++, Python MPI/OpenMP/OpenCL CUDA, Exascale Runtime 11. Data Management Hbase, Accumulo, Neo4J, MySQL 10. Data Transfer Sqoop iRODS GridFTP 9. Scheduling Yarn Slurm 8. File Systems HDFS, Object Stores Lustre 1, 11A Formats Thrift, Protobuf 5. IaaS OpenStack, Docker Linux, Bare-metal, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS FITS, HDF Big Data ABDS HPC, Cluster Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna Libraries MLlib/Mahout, R, Python ScaLAPACK, PETSc, Matlab High Level Programming Pig, Hive, Drill Domain-specific Languages Platform as a Service App Engine, BlueMix, Elastic Beanstalk Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Initial Convergence Software XSEDE Software Stack Fortran, C/C++, Python Streaming Parallel Runtime Storm, Kafka, Kinesis Hadoop, MapReduce Coordination Caching Zookeeper Memcached Data Management Data Transfer Hbase, Accumulo, Neo4J, MySQL Sqoop iRODS GridFTP Scheduling Mesos, Aurora, Yarn Slurm File Systems HDFS, Object Stores Lustre Formats Thrift, Protobuf IaaS OpenStack, Docker Linux, Bare-metal, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS MPI/OpenMP/OpenCL CUDA, Exascale Runtime FITS, HDF 4 Forms of MapReduce (1) Map Only (2) Classic MapReduce Input Input (3) Iterative Map Reduce (4) Point to Point or or Map-Collective Map-Communication Input Iterations map map map Local reduce reduce Output Graph BLAST Analysis Local Machine Learning Pleasingly Parallel High Energy Physics (HEP) Histograms Distributed search Recommender Engines Expectation maximization Clustering e.g. K-means Linear Algebra, PageRank MapReduce and Iterative Extensions (Spark, Twister) Classic MPI PDE Solvers and Particle Dynamics Graph Problems MPI, Giraph Integrated Systems such as Hadoop + Harp with Compute and Communication model separated Correspond to first 4 of Identified Architectures (5) Map Streaming maps brokers (6) Shared memory Map Communicates Shared Memory Map & Communicate Events 6 Data Analysis Architectures (6) Shared memory Map Communicates Shared Memory Difficult to parallelize asynchronous parallel Graph Algorithms Classic Hadoop in classes 1) 2) (1) Map Only Input (3) Iterative Map Reduce (4) Point to Point or or Map-Collective Map-Communication (2) Classic MapReduce Input Input Map & Communicate (5) Map Streaming maps Iterations brokers map map map Local reduce reduce Output BLAST Analysis Local Machine Learning Pleasingly Parallel Graph High Energy Physics (HEP) Histograms Web search Expectation maximization Clustering Linear Algebra, PageRank Recommender Engines MapReduce and Iterative Extensions (Spark, Twister) Harp – Enhanced Hadoop Events Classic MPI PDE Solvers and Particle Dynamics Graph Streaming images from Synchrotron sources, Telescopes, IoT MPI, Giraph Apache Storm Maps are Bolts (5) Map Streaming maps brokers (6) Shared memory Map Communicates Shared Memory Map & Communicate Events 1000000 points 50000 centroids 10000000 points 5000 centroids 100000000 points 500 centroids 10000 1000 Time (in sec) 100 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 24 48 96 ● ● ● ● 0.1 ● 24 48 96 24 48 96 Number of Cores Hadoop MR Mahout Python Scripting Spark Harp MPI Effi− ciency 1 1.0 ● ● ● ● ● ● 1 48 96 ● 24 48 96 Effi− ciency i− cy ● ● ● ● 24 er of Cores on Scripting 48 96 Spark Harp Hadoop MR 10000 Time Secs 24 MPI 48 96 24 48 Number of Cores 1000000Python points Mahout Scripting 50000 centroids Kmeans Clustering Spark 96 10000000 points MPI 5000 centroids 10000 500 Harp 1000 100 10 1 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 Efficiency 24 ● ● 48 # Cores 96 24 48 96 Number of Cores 24 Software-Defined Distributed System (SDDS) as a Service includes Dynamic Orchestration and Dataflow Software (Application Or Usage) SaaS Platform PaaS Use HPC-ABDS Class Usages e.g. run GPU & multicore Applications Control Robot Cloud e.g. MapReduce HPC e.g. PETSc, SAGA Computer Science e.g. Compiler tools, Sensor nets, Monitors Infra Software Defined Computing (virtual Clusters) structure IaaS Network NaaS Hypervisor, Bare Metal Operating System Software Defined Networks OpenFlow GENI SDDS-aaS Tools Provisioning Image Management IaaS Interoperability NaaS, IaaS tools Expt management Dynamic IaaS NaaS DevOps CloudMesh is a SDDSaaS tool that uses Dynamic Provisioning and Image Management to provide custom environments for general target systems Involves (1) creating, (2) deploying, and (3) provisioning of one or more images in a set of machines on demand http://mycloudmesh.org/ 17