4.HDFS & IO - hadoop
Download
Report
Transcript 4.HDFS & IO - hadoop
http://www.excelonlineclasses.co.nr/
[email protected]
http://www.excelonlineclasses.co.nr/
Excel Online Classes offers following
services:
Online Training
Development
Testing
Job support
Technical Guidance
Job Consultancy
Any needs of IT Sector
-
Nagarjuna K
•
Anatomy of MapReduce
•
•
•
•
•
•
•
MR work flow
Hadoop data types
Mapper
Reducer
Partitioner
Combiner
Input Split vs Block Size
Partitioning
Shuffling
.
I
N
P
U
T
D
A
T
A
NODE 1
Map
Interim
data
Reduce
Node to
store output
NODE 2
Map
Interim
data
Reduce
Node to
store output
NODE 2
Map
Interim
data
Reduce
Node to
store output
MR has a defined way of keys and values
types for it to move across cluster
Values Writable
Keys WritableComparable<T>
WritableComparable = Writable+Comparable<T>
Hadoop type
Wrapper for Java type
BooleanWritable
Boolean
ByteWritable
Byte
DoubleWritable
Double
IntWritable
Integer
LongWritable
Long
Text
String
NullWritable
Placeholder when key/value not needed
For any class to be value, it has to implement
org.apache.hadoop.io.Writable
write(DataOutput out)
readFields(DataInput in)
For any class to be key, it has to implement
org.apache.hadoop.io.WritableComparable<T>
+
compareTo(T o)
Check out few of the writables and writable
comparable
Time to write your own writables
Two libraries in Hadoop
org.apache.hadoop.mapred.*
org.apache.hadoop.mapreduce.*
Should implement
org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2>
▪ Void configure(JobConf job)
▪ All the parameters specified in the xmls are available here.
▪ Any parameter explicitly set are also available
▪ Call before data processing starts
▪ Void map (K1 key,V1 value, OutputCollector<K2,V2>
output,Reporter reporter)
▪ Data process starts
▪ Void Close()
▪ Should close any files, db connections etc.,
▪ Reporter provides extra information of mapper to TT
Mapper
Functionality
IdentityMapper
Implemetns Mapper<K,V,K,V>
• Whatever the input it gets it gives that to output
InverseMapper
Implemetns Mapper<K,V,V,K>
• Inverses the key,value from the input to output
TokenCountMapper
Implements Mapper<K,Text,Text,LongWritable>
• Generates (token,1) from the input value tokenized.
Should implement
org.apache.hadoop.mapred.Redcuer
Sorts the incoming data based on key and groups
together all the values for a key
Reduce function is called for every key in the sorted
order
▪ void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3>
output, Reporter reporter)
Reporter provides extra information of mapper to TT
Reducer
Functionality
IdentityReducer<K,V>
Implements Reducer<K,V,K,V> and maps
inputs directly to outputs
LongSumReducer<K>
Implements
Reducer<K,LongWritable,K,LongWritable>
and
determines the sum of all values
corresponding to the given key
implements Partitioner<K,V>
configure()
int getPartition ( … )
▪ 0< return<no.of.reducers
Generally, implement Partitioner so same
keys go to one reducer
Generally two kinds of files in Hadoop
Text (plain , XML, html …. )
Binary (Sequence)
▪ It is a hadoop specific compressed binary file format.
▪ Optimized to transfer output from one MR to MR
We can customize
HDFS block size
Input splits
BLOCK 1
Big File is divided
into multiple blocks
and stored in hdfs.
This is a physical
division of data
dfs.block.size
(64MB default size)
BLOCK 2
LARGE FILE
BLOCK 3
BLOCK 4
Input split
LOGICAL DIVISION
A chunk of data processed by a mapper
Further divided into records
Map process these records
▪ Record = key + value
How to correlate to a DB table
▪ Group of rows split
▪ Row record
Key
Value
R1
R1
R2
R2
R3
R3
public interface InputSplit extends Writable {
long getLength() throws IOException;
String[] getLocations() throws IOException;
}
It doesn’t contain the data
Only locations where the data is present
Helps jobtracker to arrange tasktrackers (data locality).
getLength greater length split will be executed
How we get the data to mapper
Inputsplits and how the splits are divided into
records will be taken care by inputformat.
public interface InputFormat<K, V> {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V> getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException;
}
Mapper
getRecordReader() is called to get RecordReader
Once the record reader is obtained,
▪ Map method is called recursively until the end of the
split
K key = reader.createKey();
V value = reader.createValue();
while (reader.next(key, value)) {
mapper.map(key, value, output, reporter);
}
JobClient running the job
Gets inputsplits by calling getSplits() in
InputFormat
Determines data locations for the splits
Sends these locations to the JobTracker
JobTracker assigns mappers appropriately.
▪ Data locality
Base class for all implementations of
InputFormat , which uses files as input
Defines
Which files to include for the job
Implementation for generating splits
Set of Files converts to no.of splits
Splits only large files…. HOW LARGE ?
Larger than BlockSize
Can we control it ?
Property
Description
Default value
mapred.min.split.size
The smallest valid size in bytes for a
file split.
1
mapred.max.split.size
The largest valid size in bytes for a
file split.
Long.Max_value
dfs.block.size
The size of a block in HDFS in bytes.
64Mb
mapred.min.s
plit.size
mapred.max.s dfs.block.size
plit.size
split size
1
Long.MAX
64 MB
64MB
1
Long.MAX
128MB
128 MB
128 MB
Long.MAX
64MB
128MB
1
32MB
64MB
32MB
• Application may impose minimum split size greater than Block Size.
• There is no good reason to that
• Data locality is lost
Min split size
We might set it to larger than block size
But concept of data locality may be lost to some
extent
Split size calculated by formula
max(minimumSize, min(maximumSize, blockSize))
By default
▪ minimumSize < blockSize < maximumSize
Configure(JobConf job)
Property name
description
mapred.input.file
The path of the input file being
processed
mapred.input.start
The byte offset of the start of the split
map.input.length
The length of the split in bytes
Default FileInputFormat
Each line is a value
Byte offset is a key
Example
Run identity mapper program
Logical Records defined by FileInputFormat
doesn’t usually fit it into HDFS blocks.
EveryFile is written is written as sequence of bytes.
64 MB reached ? then start the new block
When 64 MB reached, the logical record may be half
written
So, the other half of logical record goes into the next
HDFS block.
So even in data locality some remote reading
is done.. a slight overhead.
Split gives logical record boundaries
Blocks – physical boundaries (size)
Files which are very small are inefficient in
mapper phase
Imagine 1GB
64Mb – 16 files – 16 mappers
100kb – 1000 files – 1000 mappers
Packs many files into single split
Data locality is taken into consideration
MR accelerates best if operated at disk
transfer rate not at seek rate
This helps in processing large files also
Same as TextInputFormat
Each split guarenteed to have N lines
mapred.line.input.format.linespermap
Each line in text file is a record
First separator character divides key and
value
Default is ‘\t’
Controller property
key.value.separator.in.input.line
InputFormat for reading sequence files
User defined Key K
User defined Value V
They are splittable files.
WellSuited for MR
They store compression
They can store arbitrary types
key,values stored as \t separated by default.
mapred.textoutputformat.separator -- parameter
CounterPart for KeyValueTextInputFormat
Can suppress key/value by using NullWritable