BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS Geoffrey Fox January 26 2014 BigDat 2015: International Winter School on Big Data Tarragona, Spain,

Download Report

Transcript BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS Geoffrey Fox January 26 2014 BigDat 2015: International Winter School on Big Data Tarragona, Spain,

BIG DATA APPLICATIONS & ANALYTICS
LOOKING AT INDIVIDUAL HPCABDS
SOFTWARE LAYERS
Geoffrey Fox
January 26 2014
BigDat 2015: International Winter School on Big Data
Tarragona, Spain, January 26-30, 2015
[email protected]
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
1
Cloud Computing Software
1/26/2015
CLOUD COMPUTING SOFTWARE
Using the HPC-ABDS Software Stack
2
1/26/2015
Cloud Computing Software
There are a lot of Big Data and HPC Software systems
Challenge! Manage environment offering these different components
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies January 14 2015
CrossCutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
Cloud Computing Software
2) Distributed
Coordination:
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
3
1/26/2015
21 layers
289
Software
Packages
17) Workflow-Orchestration: Oozie, ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy,
IPython, Dryad, Naiad, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, NiFi (NSA)
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, mlpy, scikit-learn, CompLearn, Caffe, R, Bioconductor, ImageJ, pbdR,
Scalapack, PetSc, Azure Machine Learning, Google Prediction API, Google Translation API, Torch, Theano, H2O, Google Fusion Tables,
Oracle PGX, GraphLab, GraphX, CINET, NWB, Elasticsearch, IBM System G, IBM Watson, GraphBuilder(Intel), TinkerPop
15A) High level Programming: Kite, Hive, HCatalog, Databee, Tajo, Pig, Phoenix, Shark, MRQL, Impala, Presto, Sawzall, Drill, Google
BigQuery (Dremel), Google Cloud DataFlow, Summingbird, SAP HANA, IBM META, HadoopDB, PolyBase
15B) Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, AWS Elastic Beanstalk, IBM BlueMix, Ninefold,
Aerobatic, Azure, Jelastic, Cloud Foundry, CloudBees, Engine Yard, CloudControl, appfog, dotCloud, Pivotal, OSGi, HUBzero, OODT
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, Stratosphere (Apache Flink), Reef, Hama,
Giraph, Pregel, Pegasus
14B) Streams: Storm, S4, Samza, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Scribe/ODS, Azure Stream Analytics
13) Inter process communication Collectives, point-to-point, publish-subscribe: Harp, MPI, Netty, ZeroMQ, ActiveMQ, RabbitMQ,
QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Azure Event Hubs, Amazon Lambda
Public Cloud: Amazon SNS, Google Pub Sub, Azure Queues
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis (key value), Hazelcast, Ehcache, Infinispan
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, SciDB, Apache Derby, Google Cloud SQL, Azure SQL,
Amazon RDS, rasdaman, BlinkDB, N1QL, Galera Cluster, Google F1, Amazon Redshift, IBM dashDB
11B) NoSQL: HBase, Accumulo, Cassandra, Solandra, MongoDB, CouchDB, Lucene, Solr, Berkeley DB, Riak, Voldemort, Neo4J,
Yarcdata, Jena, Sesame, AllegroGraph, RYA, Espresso, Sqrrl, Facebook Tao, Google Megastore, Google Spanner, Titan:db
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Google
Omega, Facebook Corona
8) File systems: HDFS, Swift, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS, Haystack, f4
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Whirr, JClouds, OCCI, CDMI, Libcloud, TOSCA, Libvirt
6) DevOps: Docker, Puppet, Chef, Ansible, Boto, Cobbler, Xcat, Razor, CloudMesh, Heat, Juju, Foreman, Rocks, Cisco Intelligent
Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic
5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, VMware ESXi,
vSphere, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, VMware vCloud, Amazon, Azure, Google and other public Clouds,
Networking: Google Cloud DNS, Amazon Route 53
USING HPC-ABDS LAYERS I
1) Message Protocols
This layer is unlikely to directly visible in many applications as used in
“underlying system”. Thrift and Protobuf have similar functionality
and are used to build messaging protocols between components
(services) of system
Cloud Computing Software
1/26/2015
4
2) Distributed Coordination
Zookeeper is likely to be used in many applications as it is way that
one achieves consistency in distributed systems – especially in
overall control logic and metadata. It is for example used in Apache
Storm to coordinate distributed streaming data input with multiple
servers ingesting data from multiple sensors.
JGroups is less commonly used and is very different. It builds secure
multi-cast messaging with a variety of transport mechanisms.
3) Security & Privacy I
Security & Privacy is of course a huge area present implicitly or
explicitly in all applications. It covers authentication and
authorization of users and the security of running systems. In the
Internet there are many authentication systems with sites often
allowing you to use Facebook, Microsoft , Google etc. credentials.
InCommon, operated by Internet2, federates research and higher
education institutions, in the United States with identity
management and related services.
USING HPC-ABDS LAYERS II
3) Security & Privacy II
LDAP is a simple database (key-value) forming a set of distributed
directories recording properties of users and resources according to
X.500 standard. It allows secure management of systems. OpenStack
Keystone is a role-based authorization and authentication
environment to be used in OpenStack private clouds.
Cloud Computing Software
1/26/2015
5
4) Monitoring:
Here Ambari is aimed at installing and monitoring Hadoop systems.
Nagios and Ganglia are similar system monitors with ability to gather
metrics and produce alerts. Inca is a higher level system allowing user
reporting of performance of any sub system. Essentially all systems use
monitoring but most users do not add custom reporting.
5) IaaS Management from HPC to hypervisors:
These technologies underlie all applications. The classic technology
OpenStack manages virtual machines and associated capabilities
such as storage and networking. The commercial clouds have their
own solution and it is possible to move machine images between
these different environments. As a special case there is “bare-metal”
i.e. the null hypervisor. The DevOPs technology Docker is playing an
increasing role as a linux container.
USING HPC-ABDS LAYERS III
6) DevOps
This describes technologies and approaches that automate the deployment and
installation of software systems and underlies “software-defined systems”. At IU,
we integrate tools together in Cloudmesh – Libcloud, Cobbler, Chef, Docker, Slurm,
Ansible, Puppet. Celery. We saw Docker earlier in 5 on last slide.
Cloud Computing Software
1/26/2015
6
7) Interoperability
This is both standards and interoperability libraries for services (Whirr), compute
(OCCI), virtualization and storage (CDMI)
8) File systems
One will use files in most applications but the details may not be visible to the user.
Maybe you interact with data at level of a data management system or an Object
store (OpenStack Swift or Amazon S3). Most science applications are organized
around files; commercial systems at a higher level.
9) Cluster Resource Management
You will certainly need cluster management in your application although often this
is provided by the system and not explicit to the user. Yarn from Hadoop is gaining
in popularity while Slurm is a basic HPC system as are Moab, SGE, OpenPBS while
Condor also well known for scheduling of Grid applications. Mesos is similar to
Yarn and is also becoming popular. Many systems are in fact collections of clusters
as in data centers or grids. These require management and scheduling across many
clusters; the latter is termed meta-scheduling.
USING HPC-ABDS LAYERS IV
10) Data Transport
Globus Online or GridFTP is dominant system in HPC community but
this area is often not highlighted as often application only starts after
data has made its way to disk of system to be used. Simple HTTP
protocols are used for small data transfers while the largest ones use
the “Fedex/UPS” solution of transporting disks between sites.
Cloud Computing Software
11) A) File management, B) NoSQL, C) SQL
This is a critical area for nearly all applications as it captures areas of
file, object, NoSQL and SQL data management. The many entries in
area testify to variety of problems (graphs, tables, documents,
objects) and importance of efficient solution. Just a little while ago,
this area was dominated by SQL databases and file managers.
12) In-memory databases&caches / Object-relational mapping /
Extraction Tools
This is another important area addressing two points. Firstly
conversion of data between formats and secondly enabling
caching to put as much processing as possible in memory. This is an
important optimization with Gartner highlighting this areas in several
recent hype charts with In-Memory DBMS and In-Memory Analytics.
1/26/2015
7
USING HPC-ABDS LAYERS V
Cloud Computing Software
1/26/2015
8
13) Inter process communication Collectives, point-to-point, publishsubscribe, MPI
This describes the different communication models used by the
systems in layers 13, 14) below. Results may be very sensitive to
choices here as there are big differences from disk-based versus
point to point (no disk) for Hadoop v. Harp (MPI)or the different
latencies exhibited by publish-subscribe systems. I always
recommend Pub-Sub systems like ActiveMQ or RabbitMQ for
messaging.
14) A) Basic Programming model and runtime, SPMD, MapReduce, MPI
B) Streaming
A very important layer defining the cloud (HPC-ABDS) programming
model. Includes Hadoop and related tools Spark, Twister,
Stratosphere, Hama (iterative MapReduce); Giraph, Pregel, Pegasus
(Graphs); Storm, S4, Samza (Streaming); Tez (workflow) and Yarn
integration. Most applications use something here!
15) A) High level Programming
Components at this level are not required but are very interesting
and we can expect great progress to come both in improving them
and using them. Pig and Sawzall offer data parallel programming
models; Hive, HCatalog, Shark, MRQL, Impala, and Drill support SQL
interfaces to MapReduce, HDFS and Object stores
USING HPC-ABDS LAYERS VI
15) B) Frameworks
This is exemplified by Google App Engine and Azure (when it was
called PaaS) but now there are many “integrated environments”.
Cloud Computing Software
1/26/2015
9
16) Application and Analytics
This is the “business logic” of application and where you find
machine learning algorithms like clustering. Mahout , MLlib , MLbase
are in Apache for Hadoop and Spark processing; R is a central
library from statistics community. There are many other important
libraries where we mention those in deep learning (CompLearn
Caffe), image processing (ImageJ), bioinformatics (Bioconductor)
and HPC (Scalapack and PetSc). You will nearly always need these
or other software at this level
17) Workflow-Orchestration
This layer implements orchestration and integration of the different
parts of a job. These can be specified by a directed data-flow
graph and often take a simple pipeline form illustrated in “access
pattern” 10 discussed later. This field was advanced significantly by
the Grid community and the systems are quite similar in functionality
although their maturity and ease of use can be quite different. The
interface is either visual (link programs as bubbles with data flow) or
as an XML or program (Python) script.