Transcript Document
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com 1. 2. 3. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS Understand the purpose, benefits, and limitations of the various techniques and tools Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com 1. 2. 3. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS Understand the purpose, benefits, and limitations of the various techniques and tools Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com What is the Oracle RDBMS? What is Hadoop? Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Source: Oracle Database Concepts, 12c Release 1 (12.1) Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Framework of tools Open source - Java Apache Software Foundation projects Several tools HDFS (storage) and MapReduce (analysis) HBase, Hive, Pig, Sqoop, Flume, more Network sockets Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Hadoop Distributed File System Text files, binary files Very large data blocks 64MB minimum 1GB or higher Typical is 128MB or 256MB Replication – 3 copy default Namenode and Datanodes Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Analytical Engine of Hadoop JobTracker TaskTracker Interprets data at runtime instead of using predefined schemas Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Hive (Other) Pig HBase MapReduce HDFS API HDFS Text Files Binary Files Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com NOTE: There are file systems other than HDFS. Oracle RDBMS Integrity High Schema Structured Use Frequent reads and writes Style Interactive and batch Hadoop Low Unstructured Write once, read many Batch Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com 1. 2. 3. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS Understand the purpose, benefits, and limitations of the various techniques and tools Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Sample scenario: Move data from Oracle into Hadoop Perform MapReduce on datasets that include Oracle data Move MapReduce results back into Oracle for analysis, reporting, etc. Other uses: Oracle queries that join with Hadoop datasets Scheduled batch MapReduce results to be warehoused Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Oracle and Hadoop together form a comprehensive platform for managing all forms of data, both structured and unstructured. Hadoop provides “big data” processing. Oracle provides analytic capabilities not found in Hadoop. (NOTE: This is changing.) Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com 1. 2. 3. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS Understand the purpose, benefits, and limitations of the various techniques and tools Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com 1. 2. 3. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS Understand the purpose, benefits, and limitations of the various techniques and tools Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH SELECT PUSH SQL, PL/SQL, Java / Program Units (such as Stored Procedures, etc.) PULL Custom Java w/JDBC, DBInputFormat, FileSystem API, Avro Sqoop Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Custom Java w/JDBC, DBOutputFormat, FileSystem API, Avro Sqoop PULL Oracle Loader for Hadoop Oracle SQL Connector for HDFS PUSH SQL’s SELECT statement Use Java to: Spool output Control output Use string concatenation to create delimited text HDFS files Use Java Avro API to create serialized binary HDFS files Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL To connect to RDBMS JDBC Interacts with RDBMS DBInputFormat: reading from a database DBOutputFormat: dumping to a database Generates SQL Best for smaller amounts of data org.apache.hadoop.mapreduce.lib.db To interact with HDFS Files FileSystem API Avro API (for binary files) Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL Sqoop = “SQL to Hadoop” Command line Works with any JDBC compliant RDBMS Works with any external system that supports bulk data transfer into Hadoop (HDFS, HBase, Hive) Strength: transfer of bulk data between Hadoop and RDBMS environments Read / Write / Update / Insert / Delete Stored Procedures (warning: parallel processing) Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL Open Source / Java Apache Top Level Project (Graduated from incubator level March 2012) Bundled with: Oracle Big Data Appliance CDH (Cloudera Distribution including Apache Hadoop) Also available at Apache Software Foundation http://sqoop.apache.org Latest version of Sqoop2: 1.99.3 (as of 4/15/14) Wiki: https://cwiki.apache.org/confluence/display/SQOOP Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL Text Human-readable Binary Note: Sqoop cannot currently load SequenceFile or Avro into Hive. Precision Compression Examples SequenceFile (Java-specific) Avro Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL Interacts with structured data stores outside of HDFS Moves data from structured data stores into Hbase Moves analytic results out of Hadoop to a structured data store Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL Interrogates the RDBMS data dictionary for the target schema Use MapReduce to import data into Hadoop Parallel Operation - configurable Fault Tolerance – configurable Datatype mapping: Oracle SQL data types to Java data types (VARCHAR2 = String; INTEGER = Integer, etc.) Generates Java class of structured schema Bean: “get” methods Write methods public void readFields(ResultSet __dbResults) throws SQLException; public void write(PreparedStatement __dbStmt) throws SQLException; Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL $ sqoop help usage: sqoop COMMAND [ARGS] Available commands: codegen create-hive-table eval export help import import-all-tables job list-databases list-tables merge metastore version Generate code to interact with database records Import a table definition into Hive Evaluate a SQL statement and display the results Export an HDFS directory to a database table List available commands Import a table from a database to HDFS Import tables from a database to HDFS Work with saved jobs List available databases on a server List available tables in a database Merge results of incremental imports Run a standalone Sqoop metastore Display version information Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL $ sqoop list-databases –connect "jdbc:mysql://localhost" -username steve –password changeme 14/04/24 15:35:21 INFO manager.SqlManager: Using default fetchSize of 1000 netcents2 dt_site exam_module team_wiki $ Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL Generic JDBC Connector Connectors for major RDBMS: Oracle, MySQL, SQL Server, DB2, PostgreSQL Third party: Teradata, Netezza, Couchbase Third party connectors may support direct import into third party data stores Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PUSH PULL User Guides Installation Upgrade Five Minute Demo Command Line Client Developer Guides Building Sqoop2 Development Environment Setup Java Client API Guide Developing Connector REST API Guide Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PULL Essentially – “SQL Loader” for Hadoop Java MapReduce application Runs as a Hadoop Utility w/configuration file Extensible (Attention Java programmers) Command line or standalone process Online and offline modes Requires an existing target table (staging table!) Loads data only, cannot edit Hadoop data Pre-partitions data if necessary Can pre-sort data by primary key or user-specified columns before loading Leverages Hadoop’s parallel processing Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PULL REDUCER PRESORTING JDBC MAPPER MAPPER MAPPER HDFS Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com RDBMS Oracle Loader advantages Regular expressions (vs. as-is delimited file import) Faster throughput (vs. Sqoop JDBC) Data dictionary interrogation during load Support for runtime rollback (Sqoop generates INSERT statements with no rollback support) Sqoop advantages One system for bi-directional transfer support Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com PULL Essentially the “external table” feature for Hadoop Text files only – no binary file support Treats HDFS as an external table Read only (no import / transfer) No indexing No INSERT, UPDATE, or DELETE As is data import Full table scan Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com CREATE TABLE CUSTOMER_LOGFILES ( LOGFILE_ID INTEGER(20) , LOG_NOTE VARCHAR2(120) , LOG_DATE DATE) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY log_file_dir ACCESS PARAMETERS ( records delimited by newline badfile log_file_dir:'log_bad' logfile log_file_dir:'log_records' fields terminated by ',' missing field values are null ( LOGFILE_ID , LOG_NOTE , LOG_DATE char date_format date mask "dd-mon-yyyy“ ) ) LOCATION ( 'log_data_1.csv‘ , 'log_data_2.csv') ) PARALLEL REJECT LIMIT UNLIMITED; Copyright (c) 2013 Steve O'Hearn http://www.databasetraining.com PULL Oracle to CDH via Sqoop Freeware plug-in to CDH (Cloudera Distribution including Apache Hadoop) Java command-line utility Saves Hive HQL output to an Oracle database Copyright (c) 2013 Steve O'Hearn http://www.databasetraining.com There is no one best solution Apache Sqoop and Java APIs Oracle Loader Offers preprocessing and speed Unidirectional Oracle SQL Connector Bi-directional Read/Write/Insert/Update/Delete Limitation: JDBC and available connectors Requires knowledge of Java Integrates with existing SQL calls Limited to HDFS text files Third party tools (Cloudera, Hortonworks, etc.) are adding features to Hadoop that may reduce demand for moving data back to Oracle Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com Steve O’Hearn [email protected] DatabaseTraining.com and Corbinian.com Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com