Petabytes BIG DATA Transactions + Interactions + Observations = BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors /
Download ReportTranscript Petabytes BIG DATA Transactions + Interactions + Observations = BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors /
Petabytes BIG DATA Transactions + Interactions + Observations = BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Megabytes ERP Purchase detail Customer Touches Support Contacts Purchase record Payment record User Generated Content Affiliate Networks Offer details Dynamic Funnels Offer history HD Video, Audio, Images Product/Service Logs Increasing Data Variety and Complexity APPLICATIONS OLTP, ERP, CRM Systems Custom Applications Business Analytics Packaged Applications Unstructured documents, emails Server logs DATA SYSTEM 2.8 ZB in 2012 85% from New Data Types RDBMS EDW Sentiment, Web Data MPP REPOSITORIES 15x Machine Data by 2020 40 ZB by 2020 Sensor. Machine Data Source: IDC SOURCES Geolocation Existing Sources (CRM, ERP, Clickstream, Logs) Clickstream APPLICATIONS OLTP, ERP, CRM Systems Custom Applications Business Analytics Packaged Applications DEV & DATA TOOLS Server logs EDW MPP REPOSITORIES Data Management Operations RDBMS Data Access Security OPERATIONS TOOLS Governance & Integration DATA SYSTEM Build & Test Unstructured documents, emails Sentiment, Web Data Provision, Manage & Monitor SOURCES Sensor. Machine Data Geolocation OLTP, ERP, Documents, Web Logs, Social CRM Systems Emails Click Streams Networks Machine Generated Sensor Data Geolocation Data Clickstream Hortonworks Data Platform 2.2 GOVERNANCE Data Workflow, Lifecycle & Governance Falcon Sqoop Flume WebHDFS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Script SQL Java Scala NoSQL Stream Pig Hive Cascading HBase Storm Tez Tez Spark ISV Engines Solr YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° Linux Windows ° ° ° ° ° ° ° ° ° (Hadoop File ° ° Distributed ° ° ° System) ° ° ° ° ° ° HDFS ° ° Deployment Choice OPERATIONS Authentication Authorization Accounting Data Protection Provision, Manage & Monitor Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Slider Slider Tez Others In-Memory Search SECURITY On-Premises Cloud Ambari Zookeeper Scheduling Oozie Hortonworks Data Platform (HDP) The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability HDP certifies the most recent & stable community innovation 1.2.0 0.98.4 2.6.0 4.2 0.60 0.5.1 0.6.0 0.12.0 Data Management 4.0.0 1.4.0 0.9.1 3.4.5 1.4.4 0.4.0 1.4.4 Data Access Governance & Integration Operations Ranger Knox Oozie 3.3.2 Ambari Falcon Slider Tez Solr 1.3.1 Zookeeper 0.96.1 HBase 2013 Pig October 2.2.0 Hadoop &YARN HDP 2.0 0.12.0 Hive & HCatalog 2014 3.4.5 0.5.0 0.4.0 4.0.0 Spark April 0.98.0 Storm 0.12.1 0.5.0 1.4.5 1.5.1 Phoenix 2.4.0 4.1.0 1.5.0 4.7.2 0.13.0 HDP 2.1 0.4.0 0.9.3 Flume 0.14.0 October 4.10.0 Sqoop HDP 2.2 2014 1.7.0 0.14.0 Security Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process DEV & DATA TOOLS OPERATIONAL TOOLS a HDInsight Azure x Ω SOURCES DATA SYSTEM APPLICATIONS New! Power BI INFRASTRUCTURE HDP certifies the most recent & stable community innovation 1.2.0 0.98.4 2.6.0 4.2 0.60 0.5.1 0.6.0 0.12.0 Data Management 4.0.0 1.4.0 0.9.1 3.4.5 1.4.4 0.4.0 1.4.4 Data Access Governance & Integration Ranger Oozie Operations Knox 3.3.2 Ambari Falcon Slider Tez Solr 1.3.1 Zookeeper 0.96.1 HBase 2013 Pig October 2.2.0 Hadoop &YARN HDP 2.0 0.12.0 Hive & HCatalog 2014 3.4.5 0.5.0 0.4.0 4.0.0 Spark April 0.98.0 Storm 0.12.1 0.5.0 1.4.5 1.5.1 Phoenix 2.4.0 4.1.0 1.5.0 4.7.2 0.13.0 HDP 2.1 0.4.0 0.9.3 Flume 0.14.0 October 4.10.0 Sqoop HDP 2.2 2014 1.7.0 0.14.0 Security Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process Open Source Data Management Scalable Linearly scale to store Petabytes of data Reliable • • • Distributed across “nodes” Natively redundant Single File System Redundant storage protects against node failures Flexible Store all types of data, apply flexible schemas for analysis and sharing • • • Cluster Resource Manager Built in Fault Tolerance High Cluster Utilization Economical Utilize cose efficient commodity hardware Achieve high cluster utilization ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager map 1.1 nimbus0 vertex1.1.1 vertex1.2.2 NodeManager NodeManager NodeManager NodeManager map1.2 Batch Interactive SQL vertex1.1.2 nimbus2 NodeManager NodeManager NodeManager NodeManager nimbus1 Real-Time reduce1.1 vertex1.2.1 SCALE (storage & processing) Traditional Database EDW Required on write Reads are fast MPP Analytics schema speed NoSQL Hadoop Platform Required on read Writes are fast Standards and structured governance Loosely structured Limited, no data processing processing Processing coupled with data Structured data types Multi and unstructured best fit use Data Discovery Processing unstructured data Massive Storage/Processing Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Hortonworks Data Platform (HDP) for Windows Microsoft Azure HDInsight Microsoft Analytics Platform System (APS) All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on-premises and cloud Data Operating System of Hadoop DATA ACCESS Batch Script Map Reduce Pig SQL NoSQL Stream Search Others Storm Solr In-Memory Analytics, ISV engines Hive/Tez, HBase HCatalog Accumulo YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° ° ° ° ° ° ° ° ° ° N (Hadoop Distributed File System) DATA MANAGEMENT HORTONWORKS DATA PLATFORM (HDP) For Windows Sqoop • RPC • REST (HTTP) • C LibHDFS Flume Stinger Initiative Custom Apps Business Analytics SQL Apache Hive Apache Tez Apache MapReduce Apache YARN 1 ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° ° ° ° ° ° N ° (Hadoop Distributed File System) Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) 13 Months Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Input Processor Output Task Tez Task - <Input, Processor, Output> SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Tez avoids unneeded writes to HDFS Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M R M SELECT b.id R M M HDFS JOIN (a, c) SELECT c.price M R M R R HDFS JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R Hive SQL Datatypes Hive SQL Semantics SQL Compliance INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries for IN/NOT IN, HAVING CHAR Expanded JOIN Syntax Hive 0.12 (HDP 2.0) INTERSECT / EXCEPT Hive 0.13 (HDP 2.1) Hive 0.11 Apache Storm NoSQL HBase YARN : Data Operating System 1 ° ° ° ° ° ° ° HDFS (Permanent Data Storage) ° ° ° ° ° ° ° ° ° ° ° ° N Apache Solr MapReduce Indexing Job ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SQL Java Scala Others HBase Accumulo Storm Spark Others Solr Others Pig Hive Cascading Engines NoSQL NoSQL Stream In-Memory Engines Search ISV Engines Tez Tez Tez Tez Slider Slider Slider Kafka Script Slider YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° (Hadoop Distributed File System) ° ° ° Single Use System Multi Use Data Platform Batch Apps Batch, Interactive, Online, Streaming, … 2nd Gen of Hadoop 1st Gen of Hadoop Classic Hadoop Apps Batch MapReduce MapReduce (cluster resource management & data processing) Flexible Data Processing Online Data Processing Stream Processing Hive, Pig, others… HBase, Accumulo Storm Batch & Interactive Tez Efficient Cluster Resource Management & Shared Services (YARN) HDFS Redundant, Reliable Storage (redundant, reliable storage) (HDFS) others … Apache Falcon Provides key governance framework for: Define sophisticated Worklows and DLM Policies Enable audit, compliance, and data re-processing Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only Disaster Recovery and Backup between environments Publishing data between environments for Discovery Site to Site Site to Cloud HDP Advanced Security • • • Apache Knox Enterprise Identity Provider LDAP/AD Browser Firewall Firewall Identity Providers HDP Cluster 1 Masters NN Web HCat JT DN DMZ REST Client TT YARN HBase Hive Knox Gateway GW HDP Hadoop Cluster 2 JDBC Client Masters NN JT DN A stateless reverse proxy instance deployed in DMZ Oozie -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway Hive Web HCat Oozie TT HBase YARN Ambari: Deploy, Manage, Monitor AMBARI WEB REST APIs AMBARI SERVER PROVISION compute & storage . . . MANAGE . . . . MONITOR . . . compute & storage PROVISION | MANAGE | MONITOR Ambari SCOM Server aggregates + exposes Hadoop metrics Ambari SCOM Mgmt Pack Ambari SCOM Server Ambari SCOM monitors health + alerts in case of problems HADOOP Storage & Process at Scale microsoft.com/sqlserver and Amazon Kindle Store microsoftvirtualacademy.com Azure Machine Learning, DocumentDB, and Stream Analytics http://channel9.msdn.com/Events/TechEd www.microsoft.com/learning http://microsoft.com/technet http://developer.microsoft.com