SQOOP Hcatalog Integration

Download Report

Transcript SQOOP Hcatalog Integration

SQOOP HCatalog Integration Venkat Ranganathan Sqoop Meetup 10/28/13

Agenda

• HCatalog Overview • Sqoop HCatalog integration Goals • Features • Demo • Benefits

HCatalog Overview

• Table and Storage Management Service for Hadoop – Enables PIG/MR and Hive to more easily share data on the grid • Uses the Hive Meta-store.

• Abstracts location and format of the data • Supports reading and writing files in any format for which there is a Hive Serde available.

• Now part of Hive.

Sqoop HCatalog Integration Goals

• Support HCatalog features consistent with Sqoop usage.

– Support both imports into and exports from HCatalog table – Enable Sqoop read and write data in various formats.

– Automatic table schema mapping – Data fidelity – Support for static and dynamic partition keys

Support imports and exports

• Allows the HCatalog tables to be either the source or destination of a Sqoop job.

• In an HCatalog import, target-dir and warehouse-dir are replaced with the HCatalog table name.

• Similarly for exports, the export directory is substituted with the HCatalog table name.

File format support

• HCatalog integration into Sqoop now enables Sqoop to – Import/Export files of various formats that have hive serde created – Textfiles, Sequence files, RCFiles, ORCFile,… – This makes Sqoop agnostic of the file format used which can change over time based on new innovations/needs.

Automatic table schema mapping

• Sqoop allows a hive table to be created based on the enterprise data store schema • This is enabled for HCatalog table imports as well.

• Automatic mapping with optional user overrides.

• Ability to provide a storage options for the newly created table.

• All HCatalog primitive types supported

Data fidelity

• With Text based imports (as in Sqoop hive-import option), the text values have to be massaged so that delimiters are not misinterpreted.

• Sqoop provides two options to handle this.

--hive-delims-replacement --hive-drop-import-delims • Error prone and data is modified to be stored on Hive

Data fidelity

• With HCatalog table imports to file formats like RCFile, ORCFile etc, there is no need to strip these delimiters in column values.

• Data is preserved without any massaging • If the target Hcatalog table file format is Text, then the two options can still be used as before.

--hive-delims-replacement --hive-drop-import-delims

Support for static and dynamic partitioning • HCatalog tables partition keys can be dynamic or static.

• Static partitioning keys have values provided as part of the DML (known at Query compile time) • Dynamic partitioning keys have values provided at execution time.

– Based on value of a column being imported

Support for static and dynamic partitioning • Both types of tables supported during import.

• Multiple partition keys per table are supported.

• Only one can be a static partition key can be specified (Sqoop restriction).

• Only table with one partitioning key can be automatically created.

Benefits

• Future proof your Sqoop jobs by making them agnostic of file-formats used • Remove additional steps before taking data to the target table format • Preserve data contents

Availability & Documentation

• Part of Sqoop 1.4.4 release • A chapter devoted to HCatalog integration in the User Guide • URL: https://sqoop.apache.org/docs/1.4.4/Sqoo pUserGuide.html#_sqoop_hcatalog_integr ation

© Hortonworks Inc. 2013

DEMO

Questions?

© Hortonworks Inc. 2013