PASS2012BIA305AAllAbuzzAboutHive

Transcript PASS2012BIA305AAllAbuzzAboutHive

SQLCAT: Big Data – All
Abuzz About Hive
Cindy Gross
SQLCAT BI/Big Data PM
Microsoft
http://blogs.msdn.com/cindygross
@SQLCindy
[email protected]
Ed Katibah
SQLCAT Spatial PM
Microsoft
http://blogs.msdn.com/b/edkatibah/
@Spatial_Ed
[email protected]
November 6-9, Seattle, WA
BIG AGENDA
What’s the social
sentiment for my
brand or
products
How do I better
predict future
outcomes?
How do I optimize
my fleet based on
weather and traffic
patterns?
Increases ad revenue by processing 3.5
billion events per day
Measures and ranks online user influence
by processing 3 billion signals per day
Uses sentiment analysis and web analytics
for its internal cloud
Massive Volumes
Cloud Connectivity
Real-Time Insight
Processes 464 billion rows per quarter, with
average query time under 10 secs.
Connects across 15 social networks via the
cloud for data and API access
Improves operational decision making for
IT managers and users
MANAGE ANY DATA, ANY SIZE, ANYWHERE
010101010101010101
1010101010101010
01010101010101
101010101010
VVVVROOM!
6
BIG DATA
8
BIG DATA REQUIRES AN END-TO-END APPROACH
INSIGHT
Self-Service
Collaboration
Corporate Apps
Devices
DATA
ENRICHMENT
Discover
Combine
Refine
DATA
MANAGEMENT
Relational
Non-relational
Analytical
Streaming
Hadoop architecture.
Distributed Processing
(Map Reduce)
Distributed Storage
(HDFS)
HIVE ARCHITECTURE
Hive
Hadoop
DEMO:
Analyzing a Frankenstorm
14
November 6-9, Seattle, WA
Behind the Scenes
November 6-9, Seattle, WA15
GET HDINSIGHT
Sign up for Windows Azure HDInsight Service
http://HadoopOnAzure.com (Cloud CTP)
Download Microsoft HDInsight Server
http://microsoft.com/bigdata (On-Prem CTP)
16
CREATE TABLE
CREATE EXTERNAL TABLE censusP
(State_FIPS int,
County_FIPS int,
Population bigint,
Pop_Age_Over_69 bigint,
Total_Households bigint,
Median_Household_Income bigint,
KeyID string)
COMMENT 'US Census Data'
PARTITIONED BY (Year string)
ROW FORMAT DELIMITED FIELDS TERMINATED by '\t'
STORED AS TEXTFILE;
ALTER TABLE censusP ADD PARTITION (Year = '2010')
LOCATION '/user/demo/census/2010';
17
INSIDE A HIVE TABLE
DATA TYPES
EXTERNAL / INTERNAL
PARTITIONED BY | CLUSTERED BY | SKEWED BY
Terminators
ROW FORMAT DELIMITED | SERDE
STORED AS
FIELDS/COLLECTION ITEMS/MAP KEYS TERMINATED BY
LOCATION
18
METADATA
Metadata
is stored in a MetaStore database such as
Derby
SQL Azure
SQL Server
View
SHOW TABLES 'ce.*';
DESCRIBE census;
DESCRIBE census.population;
DESCRIBE EXTENDED census;
DESCRIBE FORMATTED census;
SHOW FUNCTIONS "x.*";
SHOW FORMATTED INDEXES ON census;
19
DATA TYPES
Primitives
Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double
Characters: String
Special: Binary, Timestamp
Collections
STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’)
ARRAY <String> | Array (‘Boise’, ‘Idaho’)
MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’)
UNIONTYPE <BigInt, String, Float>
Properties
No fixed lengths
NULL handling depends on SerDe
20
STORAGE – EXTERNAL AND INTERNAL
CREATE EXTERNAL TABLE census(…)
LOCATION '/user/demo/census';
LOCATION ‘hdfs:///user/demo/census';
LOCATION ‘asv://user/demo/census';
Use EXTERNAL when
Data also used outside of Hive
Data needs to remain even after a DROP TABLE
Use custom location such as ASV
Hive should not own data and control settings, directories, etc.
Not creating table based on existing table (AS SELECT)
And
ASV = Azure Storage Vault (blob store)
INTERNAL is NOT a keyword, just leave off EXTERNAL
21
STORAGE – PARTITION AND BUCKET
CREATE EXTERNAL TABLE census (…)
PARTIONED BY (Year string)
CLUSTERED BY (population) into 256 BUCKETS
Partition
Directory for each distinct combination of string partition values
Partition key name cannot be defined in table itself
Allows partition elimination
Useful in range searches
Can slow performance if partition is not referenced in query
Buckets
Split data based on hash of a column
One HDFS file per bucket within partition sub-directory
Performance may improve for aggregates and join queries
Sampling
22
STORAGE – FILE FORMATS AND SERDES
CREATE EXTERNAL TABLE census (…)
ROW FORMAT DELIMITED
FIELDS TERMINATED by ‘\001‘
STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO
Format
TEXTFILE is common, useful when data is shared and all alphanumeric
Extensible storage formats via custom input, output formats
Extensible on disk/in-memory representation via custom SerDes
23
CREATE INDEX
CREATE INDEX census_population
ON TABLE census (population)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IN TABLE census_population_index;
ALTER INDEX census_population ON census REBUILD;
Key Points
No keys
Index data is another table
Requires REBUILD to include new data
SHOW FORMATTED INDEXES on MyTable;
Indexing May Help
Avoid many small partitions
GROUP BY
24
CREATE VIEW
CREATE VIEW censusBigPop (state_fips, county_fips, population)
AS SELECT state_fips, county_fips, population
FROM census
WHERE population > 500000
ORDER BY population;
Sample Code
SELECT * FROM censusBigPop;
DESCRIBE FORMATTED censusBigPop;
Key Points
Not materialized
Can have ORDER BY or LIMIT
25
QUERY
SELECT c.state_fips, c.county_fips, c.population
FROM census c
WHERE c.median_household_income > 100000
GROUP BY c.state_fips, c.county_fips, c.population
ORDER BY county_fips
LIMIT 100;
Key Points
Minimal caching, statistics, or optimizer
Generally reads entire data set for every query
Performance
The order of columns, tables can make a difference to performance
Use partition elimination for range filtering
26
SORTING
ORDER BY
One reducer does final sort, can be a big bottleneck
SORT BY
Sorted only within each reducer, much faster
DISTRIBUTE BY
Determines how map data is distributed to reducers
SORT BY + DISTRIBUTE BY = CLUSTER BY
Can mimic ORDER BY, better perf if even distribution
27
JOINS
Supported Hive Join Types
Equality
OUTER - LEFT, RIGHT, FULL
LEFT SEMI
Not Supported
Non-Equality
IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN)
28
JOINS
Characteristics
Multiple MapReduce jobs unless same join columns in all tables
Put largest table last in query to save memory
Joins are done left to right in query order
JOIN ON completely evaluated before WHERE starts
29
EXPLAIN
EXPLAIN SELECT * FROM census;
EXPLAIN SELECT * FROM census WHERE population > 100000;
EXPLAIN EXTENDED SELECT * FROM census;
Characteristics
Does not execute the query
Shows parsing
Lists stages, temp files, dependencies, modes, output operators, etc.
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
TOK_ALLCOLREF))))
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
30
CONFIGURE HIVE
Configuration
Hive default configuration <install-dir>/conf/hive-default.xml
Configuration variables <install-dir>/conf/hive-site.xml
Hive configuration directory HIVE_CONF_DIR environment variable
Log4j configuration <install-dir>/conf/hive-log4j.properties
Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log
31
WHY USE HIVE
BUZZ!
Cross-pollinate your existing SQL skills!
Makes Hadoop cross-correlations, joins, filters easier
Allows storage of intermediate results for faster/easier querying
Batch based processing
Individual queries still often slower than a relational database
E2E insight may be much faster
32
BI ON BIG DATA
Gain Insights
Mash-up Hive + other data in Excel
Hive data source to PowerPivot for in-memory analytics
Power View on top of PowerPivot for spectacular visualizations leading to insights
Securely share on SharePoint for collaboration, re-use, centralized data
Microsoft on top of Hadoop / Hive includes
PowerPivot
Power View
Analysis Services
PDW
StreamInsight
SQL Server
SQL Azure
Excel
33
BIG DEAL
NEXT STEPS
Get Involved
Read a bit
http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learningresources.aspx
Programming Hive Book
http://blogs.msdn.com/cindygross
Sign up: Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP)
Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP)
Think about how you can fit Big Data into your company data strategy
Suggest uses, be prepared to combat misuses
35
BIG DATA REFERENCES
Hadoop: The Definitive Guide by Tom White
SQL Server Sqoop http://bit.ly/rulsjX
JavaScript http://bit.ly/wdaTv6
Twitter https://twitter.com/#!/search/%23bigdata
Hive http://hive.apache.org
Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj
Hadoop On Azure Videos http://tinyurl.com/6munnx2
Klout http://tinyurl.com/6qu9php
Microsoft Big Data http://microsoft.com/bigdata
Denny Lee http://dennyglee.com/category/bigdata/
Carl Nolan http://tinyurl.com/6wbfxy9
Cindy Gross http://tinyurl.com/SmallBitesBigData
MICROSOFT BIG DATA AT PASS SUMMIT
BIA-305-A SQLCAT: Big Data – All Abuzz About Hive
Wednesday 1015am | Cindy Gross, Dipti Sangani, Ed Katibah
BIA-204-M MAD About Data: Solve Problems and Develop a “Data Driven Mindset”
Wednesday 1015am | Darwin Schweitzer
AD-300-M Bootstrapping Data Warehousing in Azure for Use with Hadoop
Thursday 1015am | Steve Howard, James Podgorski, Olivier Matrat, Rafael Fernandez
BIA-306-M How Klout Changed the Landscape of Social Media with Hadoop and BI
Thursday 130pm | Denny Lee, Dave Mariani
AD-316-M Harnessing Big Data with Hadoop
Friday 8am | Mike Flasko
DBA-410-S Big Data Meets SQL Server
Friday 945am | David DeWitt
AD-315-M NoSQL and Big Data Programmability
Friday 415p | Michael Rys
37
Don’t Miss!
Win prizes with new
online evaluations
Build experience with
Hands On Labs
NEW: TCC 304
Attend David DeWitt’s
spotlight session Big
Data Meets SQL
Server
DBA-410-S, Room 6E
Friday, 9:45 AM
Be SQL Server 2012
Certified with onsite
testing
Find hidden session
announcements by
following:
Room 212-214
@sqlserver
#sqlpass
Visit the SQL Clinic
and new “I MADE
THAT!” Developer
Chalk talks
NEW: 4C-3 & 4C-4
PASS Resources
Free SQL Server and BI training
Free 1-day Training Events
Regional Event
Local and Virtual User Groups
Free Online Technical Training
This is Community
Learning Center
39
Thank you
for attending this session and
the 2012 PASS Summit in Seattle
November 6-9, Seattle, WA40
Please fill out evaluations!
SQLCAT: Big Data – All
Abuzz About Hive
Cindy Gross
SQLCAT BI/Big Data PM
Microsoft
http://blogs.msdn.com/cindygross
@SQLCindy
[email protected]
Dipti Sangani
SQL Big Data PM
Microsoft
[email protected]
Ed Katibah
SQLCAT Spatial PM
Microsoft
http://blogs.msdn.com/b/edkatibah/
@Spatial_Ed
[email protected]
November 6-9, Seattle, WA