MSBIC Hadoop Series Implementing MapReduce Jobs

Transcript MSBIC Hadoop Series Implementing MapReduce Jobs

MSBIC Hadoop Series
Querying Data with Hive
Bryan Smith
email: [email protected]
twitter: @smithbryanc
MSBIC Hadoop Series
http://msbic.sqlpass.org/
Learn the basics of Hadoop through a combination of demonstration and lecture.
Session participants are invited to follow along leveraging emulation environments
and Azure-based clusters, the setting up of which we will address in our first session.
March – Getting Started
August – Processing the Data with Pig
April – Understanding the File System
September – Hadoop & MS BI
May – Implementing MapReduce Jobs
October – To Be Announced
June – Querying the Data with Hive
November – Loading Social Media Data
July – On Vacation
December – DW Integration
Today’s Session
Objectives:
1. Understand the basics of Hive
2. Demonstrate use of Hive with sample data set
The Hive Data Warehouse
MapReduce Job
SELECT MyCol, COUNT(*)
FROM MyTable
GROUP BY MyCol;
Demonstration
Demo Script 1: Create Database
show databases;
create database ufo location ‘/ufo.db’;
dfs –ls /;
Demo Script 2: Create & Load Table
use ufo;
create table sightings (
dateobs string, daterpt string, `location` string,
shape string, duration string, `description` string)
row format delimited
fields terminated by '\t‘;
load data inpath '/demo/ufo/in/ufo_awesome.tsv'
overwrite into table sightings;
dfs –ls /ufo.db;
dfs –ls /ufo.db/sightings;
dfs –ls /demo/ufo/in;
Demo Script 3: Query Table
select * from sightings limit 10;
select
substring(dateobs, 0, 4) as year,
shape,
count(*)
from sightings
group by year, shape;
create table SightingsSummary as
select
substring(dateobs, 0, 4) as year,
shape,
count(*)
from sightings
group by year, shape;
Managed vs. External Tables
Managed Tables
External Tables
Table definition & associated data files
managed by Hive
Loaded data files moved to tableassociated folders
Dropping table drops data files
Table definition only managed by Hive
Loaded data files remain in original
location
Dropping table does not drop data
files
Use for transformed data only needed
by Hive
Use for initial staging or when data
needs to be accessible across wide
range of applications
File Formats
Default input format is row delimited input & output
Default format is tab-delimited input and Cntrl-A delimited field output
File access controlled by LazySimpleSerDe (default SerDe)
Default data types include…
int, bigint, tinyint, smallint, float, double, boolean, string, binary, timestamp
Complex structures supported with array, map & struct types
HCatalog
Table & storage management layer for Hadoop
Database defs, table defs, etc. presented through accessible interface
Integrated with Hive but accessible via Hive, Pig & MapReduce
Stored by default in Apache Derby database
Other databases can be substituted for better performance, HA, etc.
A Few Key Points
Object definitions are not case-sensitive…
But string comparisons and HDFS references are
Names conflicting with reserved keywords can be
employed using the `grave accent`
Reserved Keywords
Add
All
Alter
And
Array
As
Asc
Bigint
Binary
Boolean
Bucket
Buckets
By
Cast
Cluster
Clustered
Collection
Columns
Comment
Create
Data
Date
Datetime
Delimited
Desc
Describe
Directory
Distinct
Distribute
Double
Drop
Explain
Extended
External
False
Fields
Float
Format
From
Full
Function
Group
Inpath
Inputformat
Insert
Int
Into
Is
Items
Join
Keys
Left
Like
Limit
Lines
Load
Local
Location
Map
Msck
Not
Null
Of
On
Or
Order
Out
Outer
Outputformat
Overwrite
Partition
Partitioned
Partitions
Reanme
Reduce
Regexp
Replace
Right
Rlike
Row
Select
Sequencefile
Serde
Serdeproperties
Set
Show
Smallint
Sort
Sorted
Stored
String
Table
Tables
Tablesample
Tblproperties
Tblproperties
Temporary
Terminated
Textfile
Timestamp
Tinyint
To
Transform
True
Union
Using
Where
With
Resources
Today’s Session
Objectives:
1. Understand the basics of Hive
2. Demonstrate use of Hive with sample data set
For Next Session
Topic:
 Processing Data with Pig
Requested Action(s):
 Come with working HDInsight Emulator
 Load sample data sets into HDFS on Emulator