Transcript example

MapReduce
資工碩一 黃威凱
Outline
Purpose
 Example
 Method
 Advanced

資工碩一 黃威凱
PURPOSE
資工碩一 黃威凱
Purpose
Data mining
 Data processing

資工碩一 黃威凱
EXAMPLE
資工碩一 黃威凱
Example
Find the maximum temperature of year
 National Climatic Data Center(NCDC)

◦ The data is stored using a line-oriented ASCII
format , in which each line is a record
◦ There is a directory for each year from 1901
to 2001 ,each containing a gzipped file for
each weather station with its readings for that
year
資工碩一 黃威凱
Example(Data format)
資工碩一 黃威凱
Example
(Gzipped file, example for 1990)
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
資工碩一 黃威凱
METHOD
資工碩一 黃威凱
Method
Analzing the data with Unix tools
 Analzing the data with Hadoop

資工碩一 黃威凱
Method(Unix tools)
資工碩一 黃威凱
Method(Unix tools)

Here is the beginning of a run:
◦
◦
◦
◦
◦
◦
◦

% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
The complete run for the century took 42
minutes in one run single EC2 High-CPU
Extra Large Instance.
資工碩一 黃威凱
Method(Hadoop)

Use MapReduce
◦ Map
 Shuffle
◦ Reduce
資工碩一 黃威凱
Method(Hadoop)

Map function
◦ Pull out the year and the air temperature
◦ Transform key-value pairs
資工碩一 黃威凱
Method(Hadoop)

Map function
◦ The shuffle
 Each reduce task is fed by many map tasks.
資工碩一 黃威凱
Method(Hadoop)

Reduce function
◦ Iterate through the list and pick up the
maximum reading
◦ Input
 (1949, [111, 78])
 (1950, [0, 22, -11])
◦ Output:
 (1949, 111)
 (1950, 22)
資工碩一 黃威凱
Method(Hadoop)

Data flow
資工碩一 黃威凱
Method(Hadoop)

Java MapReduce-Mapper example
資工碩一 黃威凱
Method(Hadoop)

Java MapReduce-Reduce example
資工碩一 黃威凱
Method(Hadoop)

Java MapReduce-Job example
資工碩一 黃威凱
ADVANCED
資工碩一 黃威凱
Advanced

Case1
資工碩一 黃威凱
Advanced

Case2
資工碩一 黃威凱
Advanced

Case3
資工碩一 黃威凱
Advanced

Combiner Functions on Map output
◦ Example
 Map input1: (1950, 0), (1950, 20), (1950, 10)
 Map input2: (1950, 25), (1950, 15)
 After shuffle:
 Map1: (1950, [0,20,10])
 Map2: (1950, [25,15])
 No Use Combiner to reduce input
 (1950, [0, 20, 10, 25, 15])
 Use Combiner to reduce input
 (1950, [20, 25])
資工碩一 黃威凱