Transcript example
MapReduce
資工碩一 黃威凱
Outline
Purpose
Example
Method
Advanced
資工碩一 黃威凱
PURPOSE
資工碩一 黃威凱
Purpose
Data mining
Data processing
資工碩一 黃威凱
EXAMPLE
資工碩一 黃威凱
Example
Find the maximum temperature of year
National Climatic Data Center(NCDC)
◦ The data is stored using a line-oriented ASCII
format , in which each line is a record
◦ There is a directory for each year from 1901
to 2001 ,each containing a gzipped file for
each weather station with its readings for that
year
資工碩一 黃威凱
Example(Data format)
資工碩一 黃威凱
Example
(Gzipped file, example for 1990)
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
資工碩一 黃威凱
METHOD
資工碩一 黃威凱
Method
Analzing the data with Unix tools
Analzing the data with Hadoop
資工碩一 黃威凱
Method(Unix tools)
資工碩一 黃威凱
Method(Unix tools)
Here is the beginning of a run:
◦
◦
◦
◦
◦
◦
◦
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
The complete run for the century took 42
minutes in one run single EC2 High-CPU
Extra Large Instance.
資工碩一 黃威凱
Method(Hadoop)
Use MapReduce
◦ Map
Shuffle
◦ Reduce
資工碩一 黃威凱
Method(Hadoop)
Map function
◦ Pull out the year and the air temperature
◦ Transform key-value pairs
資工碩一 黃威凱
Method(Hadoop)
Map function
◦ The shuffle
Each reduce task is fed by many map tasks.
資工碩一 黃威凱
Method(Hadoop)
Reduce function
◦ Iterate through the list and pick up the
maximum reading
◦ Input
(1949, [111, 78])
(1950, [0, 22, -11])
◦ Output:
(1949, 111)
(1950, 22)
資工碩一 黃威凱
Method(Hadoop)
Data flow
資工碩一 黃威凱
Method(Hadoop)
Java MapReduce-Mapper example
資工碩一 黃威凱
Method(Hadoop)
Java MapReduce-Reduce example
資工碩一 黃威凱
Method(Hadoop)
Java MapReduce-Job example
資工碩一 黃威凱
ADVANCED
資工碩一 黃威凱
Advanced
Case1
資工碩一 黃威凱
Advanced
Case2
資工碩一 黃威凱
Advanced
Case3
資工碩一 黃威凱
Advanced
Combiner Functions on Map output
◦ Example
Map input1: (1950, 0), (1950, 20), (1950, 10)
Map input2: (1950, 25), (1950, 15)
After shuffle:
Map1: (1950, [0,20,10])
Map2: (1950, [25,15])
No Use Combiner to reduce input
(1950, [0, 20, 10, 25, 15])
Use Combiner to reduce input
(1950, [20, 25])
資工碩一 黃威凱