Title Slide No more than 2 lines

Transcript Title Slide No more than 2 lines

The Elephant
In The Room
Stuart R. Ainsworth
A DBA’s Guide To Hadoop
Please silence
cell phones
Session Evaluations
Submit by 11:59 PM EST
Friday Nov. 7 to
WIN prizes
Your feedback is
important and valuable.
ways to access
Go to
passsummit.com/evals
Download the GuideBook App
and search: PASS Summit 2014
Evaluation Deadline:
11:59 PM EST, Sunday Nov. 16
Follow the QR code link displayed
on session signage throughout the
conference venue and in the
program guide
Purpose
Rosetta Stone presentation
High level overview of Hadoop & Big Data
NOT a deep dive
NOT a demo session
Mostly theory & vocabulary
Where to learn more
Caveats
Focus is vendor-specific
● Hortonworks Hadoop
● Microsoft SQL Server
Don’t consider myself a Hadoop expert (yet)
About Me
Manage DBA’s for financial services company
Former Data Architect, DBA, developer
Linchpin People TeamMate
AtlantaMDF Chapter Leader
Infrequent blogger: http://codegumbo.com
About You
Assume that
● SQL experience
● exposure to database admin & architecture
● little to no experience with Big Data
Challenges...
..for the SQL ServerDBA
Rapid Evolution
SQL Server new version => 2-4 years
New functionality; deprecations
Hadoop “official” release => 6 months
New functionality; deprecations
Different components on separate cycles
DEVELOPERS
DBAS
Ecosystems, not product
Open-source; vendors add enhancements
Official Hadoop is only four modules:
● HDFS
● Hadoop MapReduce
● Hadoop YARN
● Hadoop Common
Hadoop Ecosystem (Hortonworks)
“Big Data”
Big Data is like teenage sex...
Everyone talks about it,
Nobody really knows how to do it,
Everyone thinks everyone else is doing it,
So everyone claims they are doing it…
-Dan Ariely
The 3, 4, 5 V’s of Big Data
Volume - data is too big to scale out
Velocity - decision window is small
Variety - multiple formats challenge integration
Veracity (Variability) - same data, different interpretations
Value – cost\benefit of data collection & retention
RDBMS versus Big Data
RDBMS
Primarily Scale-Up
Strong Typing
Normalization
Default Mutable
Mature
Big Data
Primarily Scale-Out
Schemaless
Default Immutable
Evolving
Foundations
“Gentlemen, this is a football…”
- Vince Lombardi
Hadoop
Scaleable, distributed processing framework
Official Hadoop is only four modules:
● HDFS
● Hadoop MapReduce
● Hadoop YARN
● Hadoop Common
HDFS
Hadoop Distributed File System
Inspired by Google FileSystem (2002-2003)
Cluster storage of large files across servers
Yahoo - 10,000 core Hadoop cluster(s)
Facebook - 100 PB+ (June, 2012)
http://goo.gl/SpSN
HDFS
HDFS
File permissions and authentication.
Rack aware
fsck: find missing files or blocks.
Scheduled Rebalancing
Redundancy & Replication
Built around MapReduce
Hadoop MapReduce
“Developed” by Google; patent issued in 2004
Map - filtering and sorting
Reduce - summarization
Inherently distributed
Hadoop MapReduce
Hadoop YARN
Yet Another Resource Negotiator
Splits resource management out of MapReduce
Allows for the use of other processing types (e.g., graph,
stream, etc).
Hadoop YARN
Hadoop Common
Shared libraries for Hadoop components & enhancements
Security objects are best example
● Superusers
● Service Level Authorization
● HTTP Authentication
But Wait… There’s More!
Sqoop
Data connector between RDBMS and HDFS
Command line interface
JDBC driver; BCP-like syntax
Tutorial
Hive
HiveQL - SQL like syntax
DDL scripts define tables
Query transformed into MapReduce jobs
Performance increases with scalability
Stinger initiative - Microsoft\Hortonworks
Hive
Hive
create external table price_data (stock_exchange string,
symbol string, trade_date string, open float, high float,
low float, close float, volume int, adj_close float) row
format delimited fields terminated by ',' stored as
textfile location '/user/hue/nyse/nyse_prices';
select * from price_data where symbol = 'IBM';
Hive
HCatalog
Tight integration with Hive, but supports all Hadoop data
access protocols
Define relational view into data (DDL)
“Tables” can be reused by Hive, Pig, Storm...
Tutorial
Pig
Data abstraction language; Yahoo (2006)
Based on Java; supports Python & Ruby
Procedural (SQL is declarative)
Allows for ETL
Lazy evaluation
Pig
Pig
Pig
ETL service; useful as “duct tape”
Typical scenario:
Load data into HDFS
Use Pig to scrub data, and
Pump to another “db” (e.g., MongoDB)
Web service reads from destination
But Wait… There’s Too Much!
Hortonworks
Hadoop
SQL Server
HDFS
Database
Windows Cluster
MapReduce
YARN
Hadoop Common
Relational Engine\Optimizer
Master Web Interface
≈
SQL Server Management Studio
Sqoop
BCP
Hive
SQL
HCatalog
Views
Pig
Powershell
SSIS
Administration
“Average DBA” Functionality
What does the Average DBA do?
Backup & Recovery
Performance Monitoring
Data Stewardship
Backups
NULL
Performance Monitoring
Ambari
• Provisions
• Manages
• Monitors
Allows for component
control across cluster
PERFORMANCE
RDBMS
APPLICATION GROWTH
PERFORMANCE
BIG DATA
APPLICATION GROWTH
PERFORMANCE
APPLICATION GROWTH
Scale-Up Costs (SQL Server)
Single Server
• Maximum RAM
• SAN
Licenses
• Windows
• SQL Server
• Microsoft Support
Personnel
•
•
•
•
Developers
DBA
SAN Admin
Network Admin
Facilities
• Minimum Footprint
Scale-Out Costs (Hortonworks HDP)
Multiple Servers
• Commodity
Licenses
• Windows ($$$)
• Linux (0\Support $)
• HDP Support
Personnel
• Developer
• HDP Admin
• Network Admin
Facilities
• Power
• Space
• Air
RDBMS
SYSTEM
CODE
HADOOP
Performance Tuning
SYSTEM
CODE
Performance Tuning Tips
Performance Architecture
Nathan Marz - Twitter, Storm
Lambda Architecture
Performance Architecture
Data Stewardship
“How many customers have product X?”
“What’s the data footprint for this customer?”
“How many times does this word appear in the same
sentence as this word?”
Word Count
Problem: count the number of times a word displays in a
specific record.
e.g. “Lorem ipsum dolor sit amet, consectetur adipiscing
elit.”...
Word Count
SQL Server
Create UDF to parse
strings
Hadoop
Pig script to parse
strings
Word Count - SQL Server
CREATE function WordRepeatedNumTimes
(@SourceString varchar(max),@TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE @NumTimesRepeated int
,@CurrentStringPosition int
,@LengthOfString int
,@PatternStartsAtPosition int
,@LengthOfTargetWord int
,@NewSourceString varchar(max)
Word Count - SQL Server
SET
SET
SET
SET
SET
SET
@LengthOfTargetWord = len(@TargetWord)
@LengthOfString = len(@SourceString)
@NumTimesRepeated = 0
@CurrentStringPosition = 0
@PatternStartsAtPosition = 0
@NewSourceString = @SourceString
WHILE len(@NewSourceString) >= @LengthOfTargetWord
BEGIN
SET @PatternStartsAtPosition = CHARINDEX
(@TargetWord,@NewSourceString)
IF @PatternStartsAtPosition <> 0
BEGIN
Word Count - SQL Server
SET @NumTimesRepeated = @NumTimesRepeated + 1
SET @CurrentStringPosition = @CurrentStringPosition +
@PatternStartsAtPosition + @LengthOfTargetWord
SET @NewSourceString = substring(@NewSourceString,
@PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString)
END
ELSE
BEGIN
SET @NewSourceString = ''
END
END
RETURN @NumTimesRepeated
END
Word Count (Hadoop)
a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/pig_wordcount';
Getting Started
Getting Started (Coding)
1. Lab Environment (Virtualized)
2. Install Hortonworks Sandbox
1. Setup Azure account
2. HDInsight
Theoretically, can scale to
PB, but no idea what that
will cost you.
Note that the interface
highlights Hive (with
Stinger); Pig commands are
run through Powershell
HDFS commands are
mostly Linux commands
Getting Started (Multi-Node Admin)
1.
2.
3.
4.
5.
6.
Lab Environment (Virtualized)
Setup OS (Windows or Linux)
Download (MSI or RPM)
Deploy Prereqs (Python, Java, C++)
Setup Master Node(s)
Setup Data Node(s)
Windows Installation
Tutorial
Linux Installation Tutorial
In Conclusion
Lots of vocabulary
HDFS, Pig, Hive, MapReduce
Map to SQL Server (RDBMS) vocabulary
Challenges to implementation
Administration tools & scalability questions
Where to get started
Questions?
Contact Me
Stuart R. Ainsworth
Twitter: @codegumbo
Email: [email protected]
Explore Everything PASS Has to Offer
Free SQL Server and BI Web Events
Free 1-day Training Events
Regional Event
This is Community
Business Analytics Training
Local User Groups Around
the World
Session Recordings
PASS Newsletter
Free Online Technical Training
Big Data - Dangerous
http://www.thefacehawk.com/

Title Slide No more than 2 lines

Transcript Title Slide No more than 2 lines

Directory