SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE , Hadoop, HBase

Transcript SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE , Hadoop, HBase

SOFTWARE SYSTEMS
DEVELOPMENT
MAP-REDUCE , Hadoop, HBase
The problem



Batch (offline) processing of huge data set using
commodity hardware
Linear scalability
Need infrastructure to handle all the mechanics,
allow for developer to focus on the processing
logic/algorithms
Data Sets




The New York Stock Exchange: 1 Terabyte of data per
day
Facebook: 100 billion of photos, 1 Petabyte(1000
Terabytes)
Internet Archive: 2 Petabyte of data, growing by 20
Terabytes per month
Can’t put data on a single node, need distributed file
system to hold it
Batch processing

Single write/append multiple reads
 Analyze

Log files for most frequent URL
Each data entry is self-contained
At each step , each data entry can be treated
individually
 After the aggregation, each aggregated data set can
be treated individually

Grid Computing

Grid computing
 Cluster
of processing nodes attached to shared storage
through fiber (typically Storage Area Network)


Work well for computation intensive tasks, problem
with huge data sets as network become a
bottleneck
Programming paradigm: Low level Message Passing
Interface (MPI)
Hadoop

Open-source implementation of 2 key ideas
 HDFS:
Hadoop distributed file system
 Map-Reduce: Programming Model


Build based on Google infrastructure (GFS, MapReduce papers published 2003/2004)
Java/Python/C interfaces, several projects built on
top of it
Approach



Limited but simple model fit to broad range of
applications
Handle communications, redundancies , scheduling in
the infrastructure
Move computation to data instead of moving data
to computation
Who is using Hadoop?
Distributed File System (HDFS)

Files are split into large blocks (128M, 64M)
 Compare

Replicated among Data Nodes(DN)
3

copies by default
Name Node (NN) keeps track of files and pieces
 Single

with typical FS block of 512Bytes
Master node
Stream-based I/O
 Sequential
access
HDFS: File Read
HDFS: File Write
HDFS: Data Node Distance
Map Reduce



A Programming Model
Decompose a processing job into Map and Reduce
stages
Developer need to provide code for Map and
Reduce functions, configure the job and let Hadoop
handle the rest
Map-Reduce Model
MAP function

Map each data entry into a pair
 <key,

value>
Examples
 Map
each log file entry into <URL,1>
 Map day stock trading record into <STOCK, Price>
Hadoop: Shuffle/Merge phase

Hadoop merges(shuffles) output of the MAP stage
into
 <key,

valulue1, value2, value3>
Examples
 <URL,
1 ,1 ,1 ,1 ,1 1>
 <STOCK, Price On day 1, Price On day 2..>
Reduce function


Reduce entries produces by Hadoop merging
processing into <key, value> pair
Examples
 Map
<URL, 1,1,1> into <URL, 3>
 Map <Stock, 3,2,10> into <Stock, 10>
Map-Reduce Flow
Hadoop Infrastructure

Replicate/Distribute data among the nodes
 Input
 Output


Map/Shuffle output
Schedule Processing
 Partition
Data
 Assign processing nodes (PN)
 Move code to PN(e.g. send Map/Reduce code)
 Manage failures (block CRC, rerun MAP/Reduce if
necessary)
Example: Trading Data Processing

Input:
 Historical
Stock Data
 Records are CSV (comma separated values) text file
 Each line : stock_symbol, low_price, high_price
 1987-2009 data for all stocks one record per stock
per day

Output:
 Maximum
interday delta for each stock
Map Function: Part I
Map Function: Part II
Reduce Function
Running the Job : Part I
Running the Job: Part II
Inside Hadoop
Datastore: HBASE




Distributed Column-Oriented database on top of HDFS
Modeled after Google’s BigTable data store
Random Reads/Writes on to of sequential streamoriented HDFS
Billions of Rows * Millions of Columns * Thousands of
Versions
HBASE: Logical View
Row Key
Time
Stamp
Column
Contents
“com.cnn.www” T9
T8
T6
“<html>.. “
T5
“<html>.. “
t3
“<html>.. “
Column Family Anchor
(Referred by/to)
cnnsi.com
cnn.com/1
my.look.ca
cnn.com/2
Column
“mime”
Text/html
Physical View
Row Key
Time Stamp
Column: Contents
Com.cnn.www
T6
“<html>..”
T5
“<html>..”
T3
“<html>..”
Row Key
Time Stamp
Column Family: Anchor
Com.cnn.www
T9
cnnsi.com
cnn.com/1
T5
my.look.ca
cnn.com/2
Row Key
Time Stamp
Column: mime
Com.cnn.www
T6
text/html
HBASE: Region Servers

Tables are split into horizontal regions
 Each

region comprises a subset of rows
HDFS
 Namenode,

MapReduce
 JobTracker,

dataNode
TaskTracker
HBASE
 Master
Server, Region Server
HBASE Architecture
HBASE vs RDMS

HBase tables are similar to RDBS tables with a
difference
 Rows
are sorted with a Row Key
 Only cells are versioned
 Columns can be added on the fly by client as long as
the column family they belong to preexists

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE , Hadoop, HBase

Transcript SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE , Hadoop, HBase

Directory