Personal_3.MapReduce An Introduction - hadoop
Download
Report
Transcript Personal_3.MapReduce An Introduction - hadoop
-
Ghana
•
Understanding MapReduce
•
Map Reduce - An Introduction
• Word count – default
• Word count – custom
Programming model to process large datasets
Supported languages for MR
Java
Ruby
Python
C++
Map Reduce Programs are Inherently parallel.
More data more machines to analyze.
No need to change anything in the code.
Start with WORDCOUNT example
“Do as I say, not as I do”
Word
Count
As
2
Do
2
I
2
Not
2
Say
1
define wordCount as Map<String,long>;
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);
This works until the no.of documents to process is not
very large
Spam filter
Millions of emails
Word count for analysis
Working from a single computer is time
consuming
Rewrite the program to count form multiple
machines
How do we attain parallel computing ?
1. All the machines compute fraction of
documents
2. Combine the results from all the machines
STAGE 1
define wordCount as Map<String,long>;
for each document in documentSUBSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
STAGE 2
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
Display(totalWordcount)
Master
Documents
Comp-1
Comp-2
Comp-3
Comp-4
Problems
STAGE 1
• Documents segregations to be well
defined
Master
Documents
Comp-1
Comp-2
Comp-3
Comp-4
• Bottle neck in network transfer
• Data-intensive processing
• Not computational intensive
• So better store files over
processing machines
• BIGGEST FLAW
• Storing the words and count in
memory
• Disk based hash-table
implementation needed
Problems
STAGE 2
Master
•
Phase 2 has only once machine
• Bottle Neck
• Phase 1 highly distributed though
•
Make phase 2 also distributed
•
Need changes in Phase 1
• Partition the phase-1 output (say based
on first character of the word)
• We have 26 machines in phase 2
• Single Disk based hash-table should be
now 26 Disk based hash-table
• Word count-a , worcount-b,wordcount-c
Documents
Comp-1
Comp-2
Comp-3
Comp-4
Master
Documents
Comp-1
Comp-2
Comp-3
Comp-4
A
B
C
D
E
1
2
4
5
10
Comp-10
Comp-20
A
B
C
D
E
10 20 40 5
9
.
.
.
Comp-30
Comp-40
After phase-1
From comp-1
▪
▪
▪
▪
▪
WordCount-A comp-10
WordCount-B comp-20
.
.
.
Each machine in phase 1 will shuffle its output to
different machines in phase 2
This is getting complicated
Store files where are they are being processed
Write disk-based hash table obviating RAM
limitations
Partition the phase-1 output
Shuffle the phase-1 output and send it to
appropriate reducer
This is more than a lot for word count
We haven’t even touched the fault tolerance
What if comp-1 or com-10 fails
So, A need of frame work to take care of all
these things
We concentrate only on business
Interim
output
MAPPER
REDUCER
Comp-2
Comp-3
Comp-4
Partitioning
Documents
HDFS
Comp-1
A
B
C
D
E
1
2
4
5
10
A
B
C
D
E
1
2
4
5
10
.
.
.
Shuffling
Master
Comp-10
Comp-20
Comp-30
Comp-40
Mapper
Reducer
Mapper filters and transforms the input
Reducer collects that and aggregate on that.
Extensive research is done two arrive at two
phase strategy
Mapper,Reducer,Partitioner,Shuffling
Work together common structure for data
processing
Input
Output
Mapper
<K1,V1>
List<K2,V2>
Reducer
<k2,list(v2)>
List<k3,v3>
Mapper
<key,words_per_line> : Input
<word,1> : output
Input
Output
List<K2,V2>
Reducer
Mapper
<K1,V1>
<word,list(1)> : Input
Reducer
<k2,list(v2)> List<k3,v3>
<word,count(list(1))> : Output
As said, don’t store the data in memory
So keys and values regularly have to be written to
disk.
They must be serialized.
Hadoop provides its way of deserialization
Any class to be key or value have to implement
WRITABLE class.
Java Type
Hadoop Serialized
Types
String
Text
Integer
IntWritable
Long
LongWritable
Let’s try to execute the following command
▪ hadoop jar hadoop-examples-0.20.2-cdh3u4.jar
wordcount
▪ hadoop jar hadoop-examples-0.20.2-cdh3u4.jar
wordcount <input> <output>
What does this code do ?
Switch to eclipse