Transcript Document
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
What is the Oracle RDBMS?
What is Hadoop?
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Source: Oracle Database Concepts, 12c Release 1 (12.1)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Framework of tools
Open source - Java
Apache Software Foundation projects
Several tools
HDFS (storage) and MapReduce (analysis)
HBase, Hive, Pig, Sqoop, Flume, more
Network sockets
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Hadoop Distributed File System
Text files, binary files
Very large data blocks
64MB minimum
1GB or higher
Typical is 128MB or 256MB
Replication – 3 copy default
Namenode and Datanodes
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Analytical Engine of Hadoop
JobTracker
TaskTracker
Interprets data at runtime instead of using
predefined schemas
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Hive
(Other)
Pig
HBase
MapReduce
HDFS API
HDFS
Text
Files
Binary
Files
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
NOTE:
There are file
systems other
than HDFS.
Oracle RDBMS
Integrity High
Schema Structured
Use
Frequent reads
and writes
Style
Interactive and
batch
Hadoop
Low
Unstructured
Write once, read
many
Batch
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Sample scenario:
Move data from Oracle into Hadoop
Perform MapReduce on datasets that include Oracle
data
Move MapReduce results back into Oracle for
analysis, reporting, etc.
Other uses:
Oracle queries that join with Hadoop datasets
Scheduled batch MapReduce results to be
warehoused
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Oracle and Hadoop together form a
comprehensive platform for managing all
forms of data, both structured and
unstructured.
Hadoop provides “big data” processing.
Oracle provides analytic capabilities not found
in Hadoop. (NOTE: This is changing.)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
1.
2.
3.
Understand why you may want to establish
data connections between the Oracle
RDBMS and Hadoop
Review various techniques and tools for
establishing data connections between the
Oracle RDBMS and Hadoop’s HDFS
Understand the purpose, benefits, and
limitations of the various techniques and
tools
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
SELECT
PUSH
SQL, PL/SQL,
Java / Program
Units (such as
Stored
Procedures, etc.)
PULL
Custom Java
w/JDBC,
DBInputFormat,
FileSystem API,
Avro
Sqoop
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Custom Java
w/JDBC,
DBOutputFormat,
FileSystem API,
Avro
Sqoop
PULL
Oracle Loader for
Hadoop
Oracle SQL
Connector for
HDFS
PUSH
SQL’s SELECT statement
Use Java to:
Spool output
Control output
Use string concatenation to create delimited
text HDFS files
Use Java Avro API to create serialized binary
HDFS files
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
To connect to RDBMS
JDBC
Interacts with RDBMS
DBInputFormat: reading from a database
DBOutputFormat: dumping to a database
Generates SQL
Best for smaller amounts of data
org.apache.hadoop.mapreduce.lib.db
To interact with HDFS Files
FileSystem API
Avro API (for binary files)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
Sqoop = “SQL to Hadoop”
Command line
Works with any JDBC compliant RDBMS
Works with any external system that supports bulk
data transfer into Hadoop (HDFS, HBase, Hive)
Strength: transfer of bulk data between Hadoop
and RDBMS environments
Read / Write / Update / Insert / Delete
Stored Procedures (warning: parallel processing)
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
Open Source / Java
Apache Top Level Project (Graduated from
incubator level March 2012)
Bundled with:
Oracle Big Data Appliance
CDH (Cloudera Distribution including Apache Hadoop)
Also available at Apache Software Foundation
http://sqoop.apache.org
Latest version of Sqoop2: 1.99.3 (as of 4/15/14)
Wiki: https://cwiki.apache.org/confluence/display/SQOOP
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
Text
Human-readable
Binary
Note:
Sqoop cannot currently
load SequenceFile or
Avro into Hive.
Precision
Compression
Examples
SequenceFile (Java-specific)
Avro
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
Interacts with structured data stores outside of
HDFS
Moves data from structured data stores into
Hbase
Moves analytic results out of Hadoop to a
structured data store
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
Interrogates the RDBMS data dictionary for the target
schema
Use MapReduce to import data into Hadoop
Parallel Operation - configurable
Fault Tolerance – configurable
Datatype mapping: Oracle SQL data types to Java data
types (VARCHAR2 = String; INTEGER = Integer, etc.)
Generates Java class of structured schema
Bean: “get” methods
Write methods
public void readFields(ResultSet __dbResults) throws
SQLException;
public void write(PreparedStatement __dbStmt) throws
SQLException;
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen
create-hive-table
eval
export
help
import
import-all-tables
job
list-databases
list-tables
merge
metastore
version
Generate code to interact with database records
Import a table definition into Hive
Evaluate a SQL statement and display the results
Export an HDFS directory to a database table
List available commands
Import a table from a database to HDFS
Import tables from a database to HDFS
Work with saved jobs
List available databases on a server
List available tables in a database
Merge results of incremental imports
Run a standalone Sqoop metastore
Display version information
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
$ sqoop list-databases –connect "jdbc:mysql://localhost" -username steve
–password changeme
14/04/24 15:35:21 INFO manager.SqlManager: Using default fetchSize of 1000
netcents2
dt_site
exam_module
team_wiki
$
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
Generic JDBC Connector
Connectors for major RDBMS: Oracle, MySQL,
SQL Server, DB2, PostgreSQL
Third party: Teradata, Netezza, Couchbase
Third party connectors may support direct
import into third party data stores
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PUSH
PULL
User Guides
Installation
Upgrade
Five Minute Demo
Command Line Client
Developer Guides
Building Sqoop2
Development Environment Setup
Java Client API Guide
Developing Connector
REST API Guide
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PULL
Essentially – “SQL Loader” for Hadoop
Java MapReduce application
Runs as a Hadoop Utility w/configuration file
Extensible (Attention Java programmers)
Command line or standalone process
Online and offline modes
Requires an existing target table (staging table!)
Loads data only, cannot edit Hadoop data
Pre-partitions data if necessary
Can pre-sort data by primary key or user-specified
columns before loading
Leverages Hadoop’s parallel processing
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PULL
REDUCER
PRESORTING
JDBC
MAPPER
MAPPER
MAPPER
HDFS
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
RDBMS
Oracle Loader advantages
Regular expressions (vs. as-is delimited file import)
Faster throughput (vs. Sqoop JDBC)
Data dictionary interrogation during load
Support for runtime rollback (Sqoop generates
INSERT statements with no rollback support)
Sqoop advantages
One system for bi-directional transfer support
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
PULL
Essentially the “external table” feature for
Hadoop
Text files only – no binary file support
Treats HDFS as an external table
Read only (no import / transfer)
No indexing
No INSERT, UPDATE, or DELETE
As is data import
Full table scan
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
CREATE TABLE CUSTOMER_LOGFILES
( LOGFILE_ID INTEGER(20)
, LOG_NOTE
VARCHAR2(120)
, LOG_DATE
DATE)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY log_file_dir
ACCESS PARAMETERS
( records delimited by newline
badfile log_file_dir:'log_bad'
logfile log_file_dir:'log_records'
fields terminated by ','
missing field values are null
( LOGFILE_ID
, LOG_NOTE
, LOG_DATE char date_format date mask "dd-mon-yyyy“
)
)
LOCATION ( 'log_data_1.csv‘
, 'log_data_2.csv')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Copyright (c) 2013 Steve O'Hearn
http://www.databasetraining.com
PULL
Oracle to CDH via Sqoop
Freeware plug-in to CDH (Cloudera
Distribution including Apache Hadoop)
Java command-line utility
Saves Hive HQL output to an Oracle database
Copyright (c) 2013 Steve O'Hearn
http://www.databasetraining.com
There is no one best solution
Apache Sqoop and Java APIs
Oracle Loader
Offers preprocessing and speed
Unidirectional
Oracle SQL Connector
Bi-directional
Read/Write/Insert/Update/Delete
Limitation: JDBC and available connectors
Requires knowledge of Java
Integrates with existing SQL calls
Limited to HDFS text files
Third party tools (Cloudera, Hortonworks, etc.) are adding
features to Hadoop that may reduce demand for moving
data back to Oracle
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com
Steve O’Hearn
[email protected]
DatabaseTraining.com
and
Corbinian.com
Copyright (c) 2014 Steve O'Hearn
http://www.databasetraining.com