Transcript Slide 1

Hadoop: An Industry Perspective
Amr Awadallah
Founder/CTO, Cloudera, Inc.
Massive Data Analytics over the Cloud (MDAC’2010)
Monday, April 26th, 2010
Outline
▪ What
is Hadoop?
▪ Overview of HDFS and MapReduce
▪ How Hadoop augments an RDBMS?
▪ Industry Business Needs:
Data Consolidation (Structured or Not)
▪ Data Schema Agility (Evolve Schema Fast)
▪ Query Language Flexibility (Data Engineering)
▪ Data Economics (Store More for Longer)
▪
▪ Conclusion
Amr Awadallah, Cloudera Inc
2
What is Hadoop?
▪
A scalable fault-tolerant distributed system for
data storage and processing
▪
Its scalability comes from the marriage of:
▪
HDFS: Self-Healing High-Bandwidth Clustered Storage
▪
MapReduce: Fault-Tolerant Distributed Processing
▪
Operates on structured and complex data
▪
A large and active ecosystem (many developers
and additions like HBase, Hive, Pig, …)
▪
Open source under the Apache License
▪
http://wiki.apache.org/hadoop/
Amr Awadallah, Cloudera Inc
3
Hadoop History
▪
2002-2004: Doug Cutting and Mike Cafarella started working on Nutch
▪
2003-2004: Google publishes GFS and MapReduce papers
▪
2004: Cutting adds DFS & MapReduce support to Nutch
▪
2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
▪
2007: NY Times converts 4TB of archives over 100 EC2s
▪
2008: Web-scale deployments at Y!, Facebook, Last.fm
▪
April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
▪
May 2009:
▪
Yahoo does fastest sort of a TB, 62secs over 1460 nodes
▪
Yahoo sorts a PB in 16.25hours over 3658 nodes
▪
June 2009, Oct 2009: Hadoop Summit, Hadoop World
▪
September 2009: Doug Cutting joins Cloudera
Amr Awadallah, Cloudera Inc
4
Hadoop Design Axioms
1.
2.
3.
4.
System Shall Manage and Heal Itself
Performance Shall Scale Linearly
Compute Shall Move to Data
Simple Core, Modular and Extensible
Amr Awadallah, Cloudera Inc
5
HDFS: Hadoop Distributed File System
Block Size = 64MB
Replication Factor = 3
Cost/GB is a few
¢/month vs $/month
Amr Awadallah, Cloudera Inc
6
MapReduce: Distributed Processing
Amr Awadallah, Cloudera Inc
7
ETL Tools
BI Reporting
RDBMS
Pig (Data Flow)
Hive (SQL)
Sqoop
MapReduce (Job Scheduling/Execution System)
HBase (key-value store)
(Streaming/Pipes APIs)
HDFS
(Hadoop Distributed File System)
Amr Awadallah, Cloudera Inc
Avro (Serialization)
Zookeepr (Coordination)
Apache Hadoop Ecosystem
8
Use The Right Tool For The Right Job
Relational Databases:
When to use?
Hadoop:
When to use?
•
Interactive Reporting (<1sec)
•
Affordable Storage/Compute
•
Multistep Transactions
•
Structured or Not (Agility)
•
Lots of Inserts/Updates/Deletes
•
Resilient Auto Scalability
Amr Awadallah, Cloudera Inc
9
Typical Hadoop Architecture
Business Users
End Customers
Business Intelligence
Interactive Application
OLAP Data Mart
OLTP Data Store
Engineers
Hadoop: Storage and Batch Processing
Data Collection
Amr Awadallah, Cloudera Inc
10
Complex Data is Growing Really Fast
Gartner – 2009
▪
Enterprise Data will grow 650%
in the next 5 years.
▪
80% of this data will be
unstructured (complex) data
IDC – 2008
▪
▪
Data types
Complex
Structured
85% of all corporate information
is in unstructured (complex) forms
Growth of unstructured data
(61.7% CAGR) will far outpace
that of transactional data
Amr Awadallah, Cloudera Inc
11
Data Consolidation: One Place For All
Complex Data
Documents
Web feeds
System logs
Online forums
SharePoint
Sensor data
EMB archives
Images/Video
Structured Data (“relational”)
Inventory
CRM
Financials Sales records
HR records
Logistics
Data Marts Web Profiles
A single data system to enable processing
across the universe of data types.
Amr Awadallah, Cloudera Inc
12
Data Agility: Schema on Read vs Write
Schema-on-Write:
•
Schema must be created
before data is loaded.
•
An explicit load operation has
to take place which transforms
the data to the internal
structure of the database.
Schema-on-Read:
•
Data is simply copied to the file
store, no special transformation
is needed.
•
A SerDe (Serializer/Deserlizer)
is applied during read time to
extract the required columns.
•
New columns must be added
explicitly before data for such
columns can be loaded into
the database.
•
New data can start flowing
anytime and will appear
retroactively once the SerDe is
updated to parse them.
•
Read is Fast.
•
Load is Fast
•
Standards/Governance.
•
Evolving Schemas/Agility
Amr Awadallah, Cloudera Inc
13
Query Language Flexibility
▪
Java MapReduce: Gives the most flexibility and
performance, but potentially long development cycle (the
“assembly language” of Hadoop).
▪
Streaming MapReduce: Allows you to develop in any
programming language of your choice, but slightly lower
performance and less flexibility.
▪
Pig: A relatively new language out of Yahoo, suitable for
batch data flow workloads
▪
Hive: A SQL interpreter on top of MapReduce, also
includes a meta-store mapping files to their schemas and
associated SerDe’s. Hive also supports User-DefinedFunctions and pluggable MapReduce streaming functions
in any language.
Amr Awadallah, Cloudera Inc
14
Hive Extensible Data Types
▪
STRUCTS:
▪
▪
MAPS (Hashes):
▪
▪
SELECT mytable.mycolumn[mykey] FROM …
ARRAYS:
▪
•
SELECT mytable.mycolumn.myfield FROM …
SELECT mytable.mycolumn[5] FROM …
JSON:
•
SELECT get_json_object(mycolumn, objpath)
Amr Awadallah, Cloudera Inc
15
Data Economics (Return On Byte)
• Return on Byte = value to be extracted from that
byte / cost of storing that byte.
• If ROB is < 1 then it will be buried into tape
wasteland, thus we need cheaper active storage.
High ROB
Low ROB
Amr Awadallah, Cloudera Inc
16
Case Studies: Hadoop World ‘09
▪
VISA: Large Scale Transaction Analysis
▪
JP Morgan Chase: Data Processing for Financial Services
▪
China Mobile: Data Mining Platform for Telecom Industry
▪
Rackspace: Cross Data Center Log Processing
▪
Booz Allen Hamilton: Protein Alignment using Hadoop
▪
eHarmony: Matchmaking in the Hadoop Cloud
▪
General Sentiment: Understanding Natural Language
▪
Yahoo!: Social Graph Analysis
▪
Visible Technologies: Real-Time Business Intelligence
▪
Facebook: Rethinking the Data Warehouse with Hadoop and Hive
Slides and Videos at http://www.cloudera.com/hadoop-world-nyc
Amr Awadallah, Cloudera Inc
17
Cloudera Desktop for Hadoop
Amr Awadallah, Cloudera Inc
18
Conclusion
Hadoop is a scalable distributed data
processing system which enables:
1.
Consolidation (Structured or Not)
2.
Data Agility (Evolving Schemas)
3.
Query Flexibility (Any Language)
4.
Economical Storage (ROB > 1)
Amr Awadallah, Cloudera Inc
19
Contact Information
Amr Awadallah
CTO, Cloudera Inc.
[email protected]
http://twitter.com/awadallah
Online Training Videos and Info:
http://cloudera.com/hadoop-training
http://cloudera.com/blog
http://twitter.com/cloudera
Amr Awadallah, Cloudera Inc
20
(c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
MapReduce: The Programming Model
SELECT word, COUNT(1) FROM docs GROUP BY word;
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Split 1
Map 1
(docid, text)
(words, counts)
(sorted words, counts)
Be, 5
“To Be
Or Not
To Be?”
Reduce 1
(sorted words,
sum of counts)
Output
File 1
Be, 30
Be, 12
Split i
(docid, text)
Map i
Be, 7
Be, 6
Split N
(docid, text)
Map M
(words, counts)
Amr Awadallah, Cloudera Inc
Reduce i
(sorted words,
sum of counts)
Reduce R
(sorted words,
sum of counts)
Shuffle
(sorted words, counts)
Output
File i
Output
File R
22
Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs
Name Node
Job Tracker
Maintains mapping of file blocks
to data node slaves
Schedules jobs across
task tracker slaves
Data Node
Task Tracker
Stores and serves
blocks of data
Runs tasks (work units)
within a job
Share Physical Node
Amr Awadallah, Cloudera Inc
23
Economics of Hadoop Storage
▪
Typical Hardware:
▪
Two Quad Core Nehalems
▪
24GB RAM
▪
12 * 1TB SATA disks (JBOD mode, no need for RAID)
▪
1 Gigabit Ethernet card
▪
Cost/node: $5K/node
▪
Effective HDFS Space:
▪
¼ reserved for temp shuffle space, which leaves 9TB/node
▪
3 way replication leads to 3TB effective HDFS space/node
▪
But assuming 7x compression that becomes ~ 20TB/node
Effective Cost per user TB: $250/TB
Other solutions cost in the range of $5K to $100K per user TB
Amr Awadallah, Cloudera Inc
24
Data Engineering vs Business Intelligence
▪
▪
Business Intelligence:
▪
The practice of extracting business numbers to
monitor and evaluate the health of the business.
▪
Humans make decisions based on these
numbers to improve revenues or reduce costs.
Data Engineering:
▪
The science of writing algorithms that convert
data into money  Alternatively, how to
automatically transform data into new features
that increase revenues or reduce costs.
Amr Awadallah, Cloudera Inc
25