Transcript Document

Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com


What is the Oracle RDBMS?
What is Hadoop?
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Source: Oracle Database Concepts, 12c Release 1 (12.1)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com




Framework of tools
Open source - Java
Apache Software Foundation projects
Several tools



HDFS (storage) and MapReduce (analysis)
HBase, Hive, Pig, Sqoop, Flume, more
Network sockets
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com



Hadoop Distributed File System
Text files, binary files
Very large data blocks





64MB minimum
1GB or higher
Typical is 128MB or 256MB
Replication – 3 copy default
Namenode and Datanodes
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com




Analytical Engine of Hadoop
JobTracker
TaskTracker
Interprets data at runtime instead of using
predefined schemas
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Hive
(Other)
Pig
HBase
MapReduce
HDFS API
HDFS
Text
Files
Binary
Files
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
NOTE:
There are file
systems other
than HDFS.
Oracle RDBMS
Integrity High
Schema Structured
Use
Frequent reads
and writes
Style
Interactive and
batch
Hadoop
Low
Unstructured
Write once, read
many
Batch
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com

Sample scenario:
Move data from Oracle into Hadoop
 Perform MapReduce on datasets that include Oracle
data
 Move MapReduce results back into Oracle for
analysis, reporting, etc.


Other uses:


Oracle queries that join with Hadoop datasets
Scheduled batch MapReduce results to be
warehoused
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com



Oracle and Hadoop together form a
comprehensive platform for managing all
forms of data, both structured and
unstructured.
Hadoop provides “big data” processing.
Oracle provides analytic capabilities not found
in Hadoop. (NOTE: This is changing.)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH

SELECT

PUSH

SQL, PL/SQL,
Java / Program
Units (such as
Stored
Procedures, etc.)

PULL


Custom Java
w/JDBC,
DBInputFormat,
FileSystem API,
Avro
Sqoop
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Custom Java
w/JDBC,
DBOutputFormat,
FileSystem API,
Avro
Sqoop
PULL


Oracle Loader for
Hadoop
Oracle SQL
Connector for
HDFS
PUSH


SQL’s SELECT statement
Use Java to:




Spool output
Control output
Use string concatenation to create delimited
text HDFS files
Use Java Avro API to create serialized binary
HDFS files
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL

To connect to RDBMS

JDBC
 Interacts with RDBMS


DBInputFormat: reading from a database
DBOutputFormat: dumping to a database
 Generates SQL
 Best for smaller amounts of data
 org.apache.hadoop.mapreduce.lib.db

To interact with HDFS Files


FileSystem API
Avro API (for binary files)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL







Sqoop = “SQL to Hadoop”
Command line
Works with any JDBC compliant RDBMS
Works with any external system that supports bulk
data transfer into Hadoop (HDFS, HBase, Hive)
Strength: transfer of bulk data between Hadoop
and RDBMS environments
Read / Write / Update / Insert / Delete
Stored Procedures (warning: parallel processing)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL



Open Source / Java
Apache Top Level Project (Graduated from
incubator level March 2012)
Bundled with:





Oracle Big Data Appliance
CDH (Cloudera Distribution including Apache Hadoop)
Also available at Apache Software Foundation
http://sqoop.apache.org
Latest version of Sqoop2: 1.99.3 (as of 4/15/14)
Wiki: https://cwiki.apache.org/confluence/display/SQOOP
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL

Text


Human-readable
Binary
Note:
Sqoop cannot currently
load SequenceFile or
Avro into Hive.
Precision
 Compression
 Examples

 SequenceFile (Java-specific)
 Avro
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL



Interacts with structured data stores outside of
HDFS
Moves data from structured data stores into
Hbase
Moves analytic results out of Hadoop to a
structured data store
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL


Interrogates the RDBMS data dictionary for the target
schema
Use MapReduce to import data into Hadoop
Parallel Operation - configurable
Fault Tolerance – configurable




Datatype mapping: Oracle SQL data types to Java data
types (VARCHAR2 = String; INTEGER = Integer, etc.)
Generates Java class of structured schema
Bean: “get” methods
Write methods




public void readFields(ResultSet __dbResults) throws
SQLException;
public void write(PreparedStatement __dbStmt) throws
SQLException;
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen
create-hive-table
eval
export
help
import
import-all-tables
job
list-databases
list-tables
merge
metastore
version
Generate code to interact with database records
Import a table definition into Hive
Evaluate a SQL statement and display the results
Export an HDFS directory to a database table
List available commands
Import a table from a database to HDFS
Import tables from a database to HDFS
Work with saved jobs
List available databases on a server
List available tables in a database
Merge results of incremental imports
Run a standalone Sqoop metastore
Display version information
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
$ sqoop list-databases –connect "jdbc:mysql://localhost" -username steve
–password changeme
14/04/24 15:35:21 INFO manager.SqlManager: Using default fetchSize of 1000
netcents2
dt_site
exam_module
team_wiki
$
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL




Generic JDBC Connector
Connectors for major RDBMS: Oracle, MySQL,
SQL Server, DB2, PostgreSQL
Third party: Teradata, Netezza, Couchbase
Third party connectors may support direct
import into third party data stores
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL

User Guides





Installation
Upgrade
Five Minute Demo
Command Line Client
Developer Guides





Building Sqoop2
Development Environment Setup
Java Client API Guide
Developing Connector
REST API Guide
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PULL










Essentially – “SQL Loader” for Hadoop
Java MapReduce application
Runs as a Hadoop Utility w/configuration file
Extensible (Attention Java programmers)
Command line or standalone process
Online and offline modes
Requires an existing target table (staging table!)
Loads data only, cannot edit Hadoop data
Pre-partitions data if necessary
Can pre-sort data by primary key or user-specified
columns before loading

Leverages Hadoop’s parallel processing
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PULL
REDUCER
PRESORTING
JDBC
MAPPER
MAPPER
MAPPER
HDFS
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
RDBMS

Oracle Loader advantages





Regular expressions (vs. as-is delimited file import)
Faster throughput (vs. Sqoop JDBC)
Data dictionary interrogation during load
Support for runtime rollback (Sqoop generates
INSERT statements with no rollback support)
Sqoop advantages

One system for bi-directional transfer support
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PULL








Essentially the “external table” feature for
Hadoop
Text files only – no binary file support
Treats HDFS as an external table
Read only (no import / transfer)
No indexing
No INSERT, UPDATE, or DELETE
As is data import
Full table scan
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
CREATE TABLE CUSTOMER_LOGFILES
( LOGFILE_ID INTEGER(20)
, LOG_NOTE
VARCHAR2(120)
, LOG_DATE
DATE)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY log_file_dir
ACCESS PARAMETERS
( records delimited by newline
badfile log_file_dir:'log_bad'
logfile log_file_dir:'log_records'
fields terminated by ','
missing field values are null
( LOGFILE_ID
, LOG_NOTE
, LOG_DATE char date_format date mask "dd-mon-yyyy“
)
)
LOCATION ( 'log_data_1.csv‘
, 'log_data_2.csv')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Copyright (c) 2013 Steve O'Hearn
http://www.databasetraining.com
PULL




Oracle to CDH via Sqoop
Freeware plug-in to CDH (Cloudera
Distribution including Apache Hadoop)
Java command-line utility
Saves Hive HQL output to an Oracle database
Copyright (c) 2013 Steve O'Hearn
http://www.databasetraining.com


There is no one best solution
Apache Sqoop and Java APIs





Oracle Loader



Offers preprocessing and speed
Unidirectional
Oracle SQL Connector



Bi-directional
Read/Write/Insert/Update/Delete
Limitation: JDBC and available connectors
Requires knowledge of Java
Integrates with existing SQL calls
Limited to HDFS text files
Third party tools (Cloudera, Hortonworks, etc.) are adding
features to Hadoop that may reduce demand for moving
data back to Oracle
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Steve O’Hearn
[email protected]
DatabaseTraining.com
and
Corbinian.com
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com