UW-Madison CS professor (1976-2008) Microsoft Technical Fellow (2008 - ?? ) Manage Jim Gray Systems Lab (graysystemslab.com)
Download ReportTranscript UW-Madison CS professor (1976-2008) Microsoft Technical Fellow (2008 - ?? ) Manage Jim Gray Systems Lab (graysystemslab.com)
UW-Madison CS professor (1976-2008) Microsoft Technical Fellow (2008 - ?? ) Manage Jim Gray Systems Lab (graysystemslab.com) what’s next what? why? how? what? Customers need it for their businesses to ‘survive’ in 21 century It relieves customers’ pain If we do it right, it will make customers feel ‘great’ too “PolyBase is an ‘aspirin’, ‘air’ and ‘jewel’ in data management for Big Data” - Anonymous @Microsoft PolyBase Provides a scalable, T-SQL compatible query processing framework for combining data from both universes SELECT Results SQL Server 16 Windows Azure Blob Storage (WASB) Hadoop (nonrelational data) Allow SQL 16 customers to execute T-SQL queries against relational data in SQL Server and “semistructured” data in HDFS and/or Azure why? Increased number and variety of data sources that generate large quantities of data Sensors (e.g. location, speed, acceleration rates, acoustical, …) Web 2.0 (e.g. twitter, wikis, … ) Web clicks Realization that data is “too valuable” to delete Dramatic decline in the cost of hardware, especially storage If storage was still $100/GB there would be no big data revolution underway Machine Learning (Mahout) Query (Hive) Distributed Processing (MapReduce) Distributed Storage (HDFS) Data Integration (Sqoop/REST/ODBC) Scripting (Pig) NoSQL Database (HBase) Workflow &Scheduling (Oozie) Coordination (ZooKeeper) Management & Monitoring (Ambari) Parallel SQL systems for Scalable distributed data warehousing on HDFS file system Hive HDFS Impala Underpinnings of the entire Hadoop ecosystem Highly fault-tolerant Spark/ Shark HAWQ Hadoop Community World View Append only – no updates! Big Data goes HERE rest of the world World View Can I join you? HDFS Relational HDFS (semi-structured) Polybase Insight • Two universes of data • Structured relational You sure data • SemistructuredCAN! data in HDFS for “big data” • Polybase Goal: Relational Provide a scalable, T-SQL compatible query processing framework for combining data from both(structured) universes e.g., cleansing data before loading it e.g., joining relational tables w/ streams of tweets e.g., mine sensor data for predictive analytics Example #1: (Non-Relational Sensor data Sensor data from cars (kept in Hadoop) Structured Customer data Price policies (based on driver behavior) Relational data (kept in SQL Server PDW/APS) Example #2: (Non-Relational Social Media (kept in Hadoop) Structured Product data Product data (kept in SQL Server PDW/APS) Basket Analysis of online shoppers (based on social media behavior) Example #3: Rig old sensor data Rig new (‘hot’) data Drilling rig analysis Monitoring & functioning of rig (kept in Hadoop) More recent data (kept in SQL Server PDW/APS) PolyBase is now part of SQL 16 (CTP2) 2012 2013 2014… 2015 Past TODAY … 2016 … Future … what’s next what? why? how? PolyBase = SQL Server PDW V2 querying HDFS/Azure data, in-situ PolyBase PolyBase Leverages PDW’s parallel query execution framework Exploits PDW’s parallel query optimizer to selectively push computations on HDFS data as MapReduce jobs Polybase Data moves in parallel directly between Hadoop’s Data Nodes and PDW’s compute nodes PolyBase Standard T-SQL query language. Eliminates need for writing MapReduce jobs HDFS DB Client Connections Data Movement Service (DMS) User Queries • JDBC, Separate on each node OLEDB, process ODBC, ADO.NET • Shuffles intermediate tables • Parse SQL among compute nodes during query execution Control Node• Validate and authorize • Optimize and build query plan • Execute parallel query • Return results to client SQL Server SQL Server SQL Server SQL Server SQL Server SQL Server Scalable SQL Server DW offering Highly competitive performance and cost Currently only available in appliance form factor But, soon available as SQL DW Service in Azure KEY COMPONENTS One control node Engine Service + DMS Compiles and controls execution of SQL queries in parallel Multiple compute nodes Each with a SQL Server instance + DMS Execute SQL queries in parallel Engine Service DB DB DB DMS DMS DMS DMS Hortonworks or Cloudera Many popular fileeither formats Hadoop clusterHDFS can be Hadoop cluster can be either: supported Hadoop distributions RC, cloud ORC, … ) on premise(text, or in the Namenode (HDFS) Text Format File System RCFile ORCFile Windows Cluster Linux Cluster Format File Format File … File System System System Parquet Format File (future) System File System Hadoop Cluster Key Technical Challenges Supporting Arbitrary File Formats Parallelizing Data Transfers in HDFS (e.g., Text, RC, ORC, Parquet, … ) between compute nodes and HDFS data nodes Imposing Structure on Unstructured Data in HDFS, using external table concept Exploiting Computational Resources of Hadoop clusters Compute Node SQL Server Compute Node DMS DMS HDFS HDFS HDFS SQL Server Hadoop Cluster (augmented w/) Hides complexity of HDFS Uses Hadoop “RecordReaders/Writers” standard HDFS file types can be read/written Used to transfer data in parallel to & from Hadoop Key Technical Challenges Supporting Arbitrary File Formats Parallelizing Data Transfers in HDFS (e.g., Text, RC, ORC, Parquet, … ) between compute nodes and HDFS data nodes Imposing Structure on Unstructured Data in HDFS, using external table concept Exploiting Computational Resources of Hadoop clusters Engine Service DB DB DB DMS DMS DMS DMS Namenode (HDFS) Hadoop Cluster File System File System File System File System File System File System Key Technical Challenges Supporting Arbitrary File Formats Parallelizing Data Transfers in HDFS (e.g., Text, RC, ORC, Parquet, … ) between compute nodes and HDFS data nodes Imposing Structure on Unstructured Data in HDFS, using external table concept Exploiting Computational Resources of Hadoop clusters CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTER WITH (TYPE= HADOOP, LOCATION = ‘hdfs://10.xxx.xx.xx:8020’, JOB_TRACKER_LOCATION=’10.xxx.xx.xx:5020’); CREATE EXTERNAL FILE FORMAT TEXT_FORMAT WITH (FORMAT_TYPE = DELIMITEDTEXT', DATA_COMPRESSION = ‘org.apache.hadoop.io.compress.GzipCodec’, FORMAT_OPTIONS (FIELD_TERMINATOR = ‘\t‘)); CREATE EXTERNAL TABLE CUSTOMER ( c_custkey bigint not null, c_name varchar(25) not null, c_address varchar(40) not null, c_nationkey integer not null, … HDFS file path ) WITH (LOCATION ='/tpch1gb/customer.tbl’, DATA_SOURCE = GSL_HDFS_CLUSTER, FILE_FORMAT = TEXT_FORMAT); Selection on external table in HDFS Customer A possible execution plan: EXECUTE QUERY IMPORT FROM HDFS CREATE temp table T Select * from T where T.c_nationkey =3 and T.c_acctbal < 0 HDFS Customer file read into T Execute on compute nodes Key Technical Challenges Supporting Arbitrary File Formats Parallelizing Data Transfers in HDFS (e.g., Text, RC, ORC, Parquet, … ) between compute nodes and HDFS data nodes Imposing Structure on Unstructured Data in HDFS, using external table concept Exploiting Computational Resources of Hadoop clusters Query plan generator Query is parsed “External tables” stored on HDFS are identified walks optimized query plan converting subtrees whose inputs are all HDFS files into sequence of MapReduce jobs HDFS SQL Query Parser Logical operator tree Query Optimizer Query Plan Generator Engine Service Hadoop Physical Engine operatorService submits MapReduce jobs tree (as a JAR file) to Hadoop cluster. Leverage computational capabilities of Hadoop cluster Parallel QO is performed Statistics on HDFS tables are used in the standard fashion Hadoop MapReduce HDFS SQL operations on HDFS data pushed into Hadoop as MapReduce jobs DB Cost-based decision on how much computation to push Selection and aggregate on external table in HDFS avg Customer group by Execution plan: What really happens here? Step 1) QO compiles predicate into Java and generates a MapReduce (MR) job Step 2) QE submits MR job to Hadoop cluster Output left in hdfsTemp Run MR Job on Hadoop hdfsTemp Apply filter and compute aggregate on Customer. <US, $-975.21> <UK, $-63.52> <FRA, $-119.13> Selection and aggregate on HDFS table avg Customer group by Execution plan: RETURN OPERATION IMPORT hdfsTEMP Select * from T Read hdfsTemp into T 1. Predicate and aggregate pushed into Hadoop cluster as a MapReduce job 2. Query optimizer makes a cost-based decision on what operators to push CREATE temp table T On compute nodes Run MR Job on Hadoop Apply filter and computes aggregate on Customer. Output left in hdfsTemp hdfsTemp <US, $-975.21> <UK, $-63.52> <FRA, $-119.13> Highest Possible Performance Parallelized data transfers between PDW appliance and Hadoop clusters. Push down of SQL operations to Hadoop Simplicity Query data in Hadoop and/or data in PDW via standard T-SQL PolyBase (in SQL Server PDW) Open Supports most popular Hadoop distributions for both Linux and Windows Full Integration with Microsoft Office & BI Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services HDFS Bridge in DMS used to read/write files/directories in HDFS or Azure Engine Service to parse, optimize, & orchestrate parallel execution of queries over relational tables and HDFS data DMS used to move data between compute nodes and Hadoop data nodes External Table Construct used to “surface” records in external tables in HDFS or Azure files to SQL MapReduce Job Pushdown used by Engine Service to push computation to Hadoop cluster how? What Does It Mean? Clusters of SQL Server 16 instances can be used to process data residing in Queries can arbitrarily mix HDFS in parallel relational data in SQL Server with data in HDFS or Azure Full T-SQL interface for HDFS-resident data (RTM) All standard BI tools supported Step 1: Set up a Hadoop cluster (if you don’t have one already) Hortonworks or Cloudera Distributions Hadoop 2.0 or above Linux or Windows On premise or in Azure PolyBase DLLs PolyBase DLLs PolyBase DLLs PolyBase DLLs DMS DMS DMS DMS DMS Head node is the SQL Server instance to which queries are submitted Compute nodes are used for scale out query processing for data in HDFS or Azure DMS DMS DMS DMS Head Node is actually also always a Compute Node DMS DMS DMS DMS DMS CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTER AV_DFS_CLUSTER WITH (TYPE= HADOOP, …) Azure Azure Storage Hadoop Volume Cluster #2 Azure Storage Volume Azure Storage Volume T-SQL queries submitted here Queries can only refer to tables here and/or external tables here Compute nodes are used for scale out query processing on external tables in HDFS/Azure Tables on compute nodes cannot be referenced by queries submitted to head node Number of compute nodes can be dynamically adjusted by DBA Hadoop clusters can be shared between different SQL 16 clusters Key Differences Scaleout Storage Scaleout QP Language Surface Delivery Vehicle Relational, HDFS, Azure Relational, HDFS, Azure T-SQL Subset Appliance Relational & Azure Relational, & Azure T-SQL Subset Cloud (PASS) HDFS & Azure Relational, HDFS (CTP2) HDFS, Azure Full “Box” & T-SQL (RTM) Cloud (IAAS) 1. Each SQL 16 instance can participate in a single PolyBase cluster – A limitation of current DMS software 2. Multiple compute nodes not “production ready” 3. No support for varchar(max) or unique column types 4. No scale out for joins between tables on Head Node and external tables on HDFS This should be fixed by CTP3 (for sure by RTM) Example on next slide Select C.Name from Orders O, Customers C where C.id = O.CustId and O.price > $1000 tmp Probable CTP2 Query Plan: 1) Use MR job to find Orders > $1000 2) Import result into SQL 16 instance on Head Node 3) Perform join on Head Node Select C.Name from Customers C, Orders O where C.id = O.CustId and P.price > $1000 CTP3 plan likely to be: 1) Use MR job to find Orders > $1000 2) Import result into SQL 16 instances on Compute nodes 3) Redistribute Customers table from Head Node to Compute Nodes 4) Perform join in parallel on compute nodes 1. Will alwaysOn (Hadron) be supported? 2. Can a compute node be used for other SQL workloads? 3. Can a compute node share a machine with a Hadoop 4. 5. 6. 7. data node? What SQL Server editions will I need? Will the MapR Hadoop distribution be supported? Is there a limit on the number of compute nodes? Why is it not possible to support different Hadoop distros simultaneously? what’s next what? why? how? 1. “Local” table scale-out query processing 2. “Official” support for multiple compute nodes 3. Enhanced performance mechanisms⁺ • ES and DMS running inside SQL Server instead of as separate processes • Streaming of data from HDFS directly into leaves of executing SQL query plans • Native (i.e. C++) HDFS libraries for Text, ORC, and Parquet file formats • Indexing of HDFS data 1. Simplicity - Query data in Hadoop, Azure and SQL Server via standard T-SQL 2. Highest Possible Performance - Parallelized data transfers between SQL 16 instances and Hadoop data nodes. Push down of SQL operations to Hadoop using MR jobs 3. Open - Supports most popular Hadoop distributions for both Linux and Windows 4. Full Integration with Microsoft Office & BI - Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services 1) 2) 3) 55