SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE , Hadoop, HBase
Download
Report
Transcript SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE , Hadoop, HBase
SOFTWARE SYSTEMS
DEVELOPMENT
MAP-REDUCE , Hadoop, HBase
The problem
Batch (offline) processing of huge data set using
commodity hardware
Linear scalability
Need infrastructure to handle all the mechanics,
allow for developer to focus on the processing
logic/algorithms
Data Sets
The New York Stock Exchange: 1 Terabyte of data per
day
Facebook: 100 billion of photos, 1 Petabyte(1000
Terabytes)
Internet Archive: 2 Petabyte of data, growing by 20
Terabytes per month
Can’t put data on a single node, need distributed file
system to hold it
Batch processing
Single write/append multiple reads
Analyze
Log files for most frequent URL
Each data entry is self-contained
At each step , each data entry can be treated
individually
After the aggregation, each aggregated data set can
be treated individually
Grid Computing
Grid computing
Cluster
of processing nodes attached to shared storage
through fiber (typically Storage Area Network)
Work well for computation intensive tasks, problem
with huge data sets as network become a
bottleneck
Programming paradigm: Low level Message Passing
Interface (MPI)
Hadoop
Open-source implementation of 2 key ideas
HDFS:
Hadoop distributed file system
Map-Reduce: Programming Model
Build based on Google infrastructure (GFS, MapReduce papers published 2003/2004)
Java/Python/C interfaces, several projects built on
top of it
Approach
Limited but simple model fit to broad range of
applications
Handle communications, redundancies , scheduling in
the infrastructure
Move computation to data instead of moving data
to computation
Who is using Hadoop?
Distributed File System (HDFS)
Files are split into large blocks (128M, 64M)
Compare
Replicated among Data Nodes(DN)
3
copies by default
Name Node (NN) keeps track of files and pieces
Single
with typical FS block of 512Bytes
Master node
Stream-based I/O
Sequential
access
HDFS: File Read
HDFS: File Write
HDFS: Data Node Distance
Map Reduce
A Programming Model
Decompose a processing job into Map and Reduce
stages
Developer need to provide code for Map and
Reduce functions, configure the job and let Hadoop
handle the rest
Map-Reduce Model
MAP function
Map each data entry into a pair
<key,
value>
Examples
Map
each log file entry into <URL,1>
Map day stock trading record into <STOCK, Price>
Hadoop: Shuffle/Merge phase
Hadoop merges(shuffles) output of the MAP stage
into
<key,
valulue1, value2, value3>
Examples
<URL,
1 ,1 ,1 ,1 ,1 1>
<STOCK, Price On day 1, Price On day 2..>
Reduce function
Reduce entries produces by Hadoop merging
processing into <key, value> pair
Examples
Map
<URL, 1,1,1> into <URL, 3>
Map <Stock, 3,2,10> into <Stock, 10>
Map-Reduce Flow
Hadoop Infrastructure
Replicate/Distribute data among the nodes
Input
Output
Map/Shuffle output
Schedule Processing
Partition
Data
Assign processing nodes (PN)
Move code to PN(e.g. send Map/Reduce code)
Manage failures (block CRC, rerun MAP/Reduce if
necessary)
Example: Trading Data Processing
Input:
Historical
Stock Data
Records are CSV (comma separated values) text file
Each line : stock_symbol, low_price, high_price
1987-2009 data for all stocks one record per stock
per day
Output:
Maximum
interday delta for each stock
Map Function: Part I
Map Function: Part II
Reduce Function
Running the Job : Part I
Running the Job: Part II
Inside Hadoop
Datastore: HBASE
Distributed Column-Oriented database on top of HDFS
Modeled after Google’s BigTable data store
Random Reads/Writes on to of sequential streamoriented HDFS
Billions of Rows * Millions of Columns * Thousands of
Versions
HBASE: Logical View
Row Key
Time
Stamp
Column
Contents
“com.cnn.www” T9
T8
T6
“<html>.. “
T5
“<html>.. “
t3
“<html>.. “
Column Family Anchor
(Referred by/to)
cnnsi.com
cnn.com/1
my.look.ca
cnn.com/2
Column
“mime”
Text/html
Physical View
Row Key
Time Stamp
Column: Contents
Com.cnn.www
T6
“<html>..”
T5
“<html>..”
T3
“<html>..”
Row Key
Time Stamp
Column Family: Anchor
Com.cnn.www
T9
cnnsi.com
cnn.com/1
T5
my.look.ca
cnn.com/2
Row Key
Time Stamp
Column: mime
Com.cnn.www
T6
text/html
HBASE: Region Servers
Tables are split into horizontal regions
Each
region comprises a subset of rows
HDFS
Namenode,
MapReduce
JobTracker,
dataNode
TaskTracker
HBASE
Master
Server, Region Server
HBASE Architecture
HBASE vs RDMS
HBase tables are similar to RDBS tables with a
difference
Rows
are sorted with a Row Key
Only cells are versioned
Columns can be added on the fly by client as long as
the column family they belong to preexists