Map-Reduce Tutorial
Download
Report
Transcript Map-Reduce Tutorial
Cloud Computing: Project Tutorial
Hadoop Map-Reduce Programming
Wei Zhu
([email protected])
Department of Computer Science
University of Texas at Dallas
Agenda
Map Reduce Environment Configuration
Map Reduce Structure
Mapper Configuration
Combiner Configuration
Partitioner Configuration
Reducer Configuration
Useful Logs
7/12/2016
Cloud Computing: Project Tutorial Hadoop MapReduce Programming
2 of 13
Map Reduce Environment Configuration
Configuration for Ubuntu ( /etc/hadoop/conf )
hadoop-env.sh
log4j.properties
where is the log
hdfs-site.xml
hadoop cluster configuration
mapred-site.xml
information about the job
……
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
3 of --
Map Reduce Structure
Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
All values with the same key are reduced together
Optionally, also:
partition (k’, number of partitions) → partition for k’
Often a simple hash of the key, e.g., hash(k’) mod n
Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
Mini-reducers that run in memory after the map phase
Used as an optimization to reduce network traffic
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
4 of --
Map Reduce Structure
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
5 of --
Map Reduce Structure
Class Mapper :
setup (Mapper.Context context)
Called once at the beginning of the task
map (k, v) → <k’, v’>*
cleanup (Mapper.Context context)
Called once at the end of the task.
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
6 of --
Mapper Configuration
How many maps?
Number of Maps
The number of maps is usually driven by the total size of the
inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10100 maps per-node
setNumMapTasks(int)
which only provides a hint to the framework is used to set it
even higher.
Only existing in an old API JobConf
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
7 of --
Mapper Configuration
How many maps?
FileInputFormat
mapred.max.split.size by setMaxInputSplitSize(Job, long)
mapred.min.split.size by setMinInputSplitSize(Job, long)
HDFS block: set the size to a smaller value for small
data using dfs.block.size in hdfs-site.xml
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
8 of --
Combiner Configuration
Class Combiner
Semi-reducer in mapreduce
same interface with Reducer
reduce()
Process the output of map tasks before submitting to the
reducers
Works on a single mapper
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
9 of --
Partitioner Configuration
Class Partitioner
Partitioning
determining which reducer instance will receive which
intermediate keys and values.
getPartition()
receives a key and a value and the number of partitions to split
the data across
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
10 of --
Reducer Configuration
Class Reducer:
Job. setReducerClass(YourReducer.class)
setup *
Called once at the beginning of the task
reduce (k, v) → <k’, v’>*
cleanup *
Called once at the end of the task
Number of Reducer
Job.setNumReduceTasks(int);;
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
11 of --
Useful Logs
Node Resource Manager Logs
/var/log/hadoop-yarn/yarn
yarn-site.xml
Application Name, Start date, User name, Hadoop
queue, Job outcome (success or failure), Duration,
Maximum memory allocated, Percent of cluster used by
the job, Details of job executed……
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
12 of --
Useful Logs
Job History Logs
/mr-history/done
mapred-site.xml
These files contain a wealth of performance data on the
execution of Mappers and Reducers, including HDFS
statistics, data volume processed, memory allocated etc.
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
13 of --
Useful Logs
See Logs from GUI.
7/12/2016
Cloud Computing: Project One Tutorial Hadoop
Map-Reduce
14 of --