슬라이드 1 - Kangwon
Download
Report
Transcript 슬라이드 1 - Kangwon
Distributed and Parallel Processing Technology
Chapter2.
MapReduce
Sun Jo
1
Introduction
MapReduce is a programming model for data processing.
Hadoop can run MapReduce programs written in various languages.
We shall look at the same program expressed in Java, Ruby, Python, and
C++.
2
A Weather Dataset
Program that mines weather data
Weather sensors collect data every
hour at many locations across the
globe
They gather a large volume of log data,
which is good candidate for analysis
with MapReduce
Data Format
Data from the National Climate Data
Center(NCDC)
Stored using a line-oriented ASCII
format, in which each line is a record
3
A Weather Dataset
Data Format
Data files are organized by date and weather station.
There is a directory for each year from 1901 to 2001, each containing a gzipped file
for each weather station with its readings for that year.
The whole dataset is made up of a large number of relatively small files since there
are tens of thousands of weather station.
The data was preprocessed so that each year’s readings were concatenated into a
single file.
4
Analyzing the Data with Unix Tools
What’s the highest recorded global temperature for each year in the dataset?
Unix Shell script program with awk, the classic tool for processing line-oriented
data
The scripts loops through the compressed year files
printing the year processing each file using awk
Awk extracts the air temperature and the quality code from the data.
Temperature value 9999 signifies a missing value in the NCDC dataset.
Beginning of a run
Maximum temperature is 31.7℃ for 1901.
The complete run for the century took 42 minutes in one run on a single EC2 High-CPU
Extra Large Instance.
5
Analyzing the Data with Unix Tools
To speed up the processing, run parts of the program in parallel
Problems for parallel processing
Dividing the work into equal-size pieces isn’t always easy or obvious.
• The file size for different years varies
• The whole run is dominated by the longest file
• A better approach is to split the input into fixed-size chunks and assign each chunk to a process
Combining the results from independent processes may need further processing.
Still limited by the processing capacity of a single machine, handling coordination and
reliability for multiple machines
It’s feasible to parallelize the processing, though, it’s messy in practice.
6
Analyzing the Data with Hadoop – Map and Reduce
Map and Reduce
MapReduce works by breaking the processing into 2 phases: the map and the reduce.
Both map and reduce phases have key-value pairs as input and output.
Programmers have to specify two functions: map and reduce function.
The input to the map phase is the raw NCDC data.
• Here, the key is the offset of the beginning of the line and the value is each line of the data set.
The map function pulls out the year and the air temperature from each input value.
The reduce function takes <year, temperature> pairs as input and produces the
maximum temperature for each year as the result.
7
Analyzing the Data with Hadoop – Map and Reduce
Original NCDC Format
Input file for the map function, stored in HDFS
Output of the map function, running in parallel for each block
Input for the reduce function & Output of the reduce function
8
Analyzing the Data with Hadoop – Map and Reduce
The whole data flow
Shuffling
Map()
<1950, 0>
<1950, 22>
<1949,111>
<1951, 10>
<1952, 22>
<1954, 0>
<1954, 22>
<1950, -11>
<1949, 78>
<1951, 25>
Reduce ()
<1949, [111, 78]>
<1950, [0, 22, -11]>
<1951, [10, 76,34], 19>
<1952 ,[22, 34]>
<1953, [45]>
<1955, [23]>
<1949, 111>
<1950, 22>
<1951, 76>
<1952, 34>
<1953, 45>
<1955,25>
Input File
9
Analyzing the Data with Hadoop – Java MapReduce
Having run through how the MapReduce program works, express it in code
A map function, a reduce function, and some code to run the job are needed.
Map function
10
Analyzing the Data with Hadoop – Java MapReduce
Reduce function
11
Analyzing the Data with Hadoop – Java MapReduce
Main function for running the MapReduce job
12
Analyzing the Data with Hadoop – Java MapReduce
A test run
The output is written to the output directory, which contains one output file
per reducer
13
Analyzing the Data with Hadoop – Java MapReduce
The new Java MapReduce API
The new API, referred to as “Context Objects”, is type-incompatible with the old, so
applications need to be rewritten to take advantage of it.
Notable differences
• Favors abstract classes over interfaces. The Mapper and Reducer interfaces are abstract classes.
• The new API is in the org.apache.hadoop.mapreduce package and subpackages.
• The old API can still be found in org.apache.hadoop.mapred
• Makes extensive use of context objects that allow the user code to communicate with MapReduce system
• i.e.) The MapContext unifies the role of the JobConf, the OutputCollector, and the Reporter
• Supports both a ‘push’ and a ‘pull’ style of iteration
• Basically key-value record pairs are pushed to the mapper, but in addition, the new API allows a
mapper to pull records from within the map() method.
• The same goes for the reducer
• Configuration has been unified.
• The old API has a JobConf object for job configuration, which is an extension of Hadoop’s vanilla
Configuration object.
• In the new API, job configuration is done through a Configuration.
• Job control is performed through the Job class rather than JobClient.
• Output files are named slightly differently
• part-m-nnnnn for map outputs, part-r-nnnnn for reduce outputs
• (nnnnn is an integer designating the part number, starting from 0)
14
Analyzing the Data with Hadoop – Java MapReduce
The new Java MapReduce API
Example 2-6 shows the MaxTemperature application rewritten to use the new API.
15
Scaling Out
To scale out, we need to store the data in a distributed filesystem, HDFS.
Hadoop moves the MapReduce computation to each machine hosting a part
of the data.
Data Flow
A MapReduce job consists of the input data, the MapReduce program, and
configuration information.
Hadoop runs the job by dividing it into 2 types of tasks, map and reduce tasks.
Two types of nodes, 1 jobtracker and several tasktrackers
• Jobtracker : coordinates and schedules tasks to run on tasktrakers.
• Tasktrackers : run tasks and send progress report to the jobtracker.
Hadoop divides the input into fixed-size pieces, called input splits, or just splits.
Hadoop creates one map task for each split, which runs the user-defined map function
for each record in the split.
The quality of the load balancing increases as the splits become more fine-grained.
• Default size : 1 HDFS block, 64MB
Map tasks write their output to the local disk, not to HDFS.
If the node running a map task fails, Hadoop will automatically rerun the map task on
another node to re-create the map output.
16
Scaling Out
Data Flow – single reduce task
Reduce tasks don’t have the advantage of data locality – the input to a single reduce
task is normally the output from all mappers.
All map outputs are merged across the network and passed to the user-defined reduce
function.
The output of the reduce is normally stored in HDFS.
17
Scaling Out
Data Flow – multiple reduce tasks
The number of reduce tasks is specified independently not governed by the input size.
The map tasks partition their output by keys, each creating one partition for each
reduce task.
There can be many keys and their associated values in each partition, but the records for
any key are all in a single partition.
18
Scaling Out
Data Flow – zero reduce task
19
Scaling Out
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster.
It pays to minimize the data transferred between map and reduce tasks.
Hadoop allows the user to specify a combiner function to be run on the map
output – the combiner function’s output forms the input to the reduce function.
The contract for the combiner function constrains the type of function that may be
used.
Example
a combiner function
Reduce ()
Map without
()
shuffling
<1950, 0>
<1950, 20>
<1950, 10>
<1950, [0, 20, 10, 25, 15]>
<1950, 25>
<1950, 25>
<1950, 15>
Example with a combiner function, finding maximum temperature for a map
Map ()
combiner
<1950, 0>
<1950, 20>
<1950, 10>
shuffling
Reduce ()
<1950, 20>
<1950, [20, 25]>
<1950, 25>
<1950, 15>
<1950, 25>
<1950, 25>
20
Scaling Out
Combiner Functions
The function calls on the temperature values can be expressed as follows:
• Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) = max(20, 25) = 25
Calculating ‘mean’ temperatures couldn’t use the mean as the combiner function
• mean(0, 20, 10, 25, 15) = 14
• mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15.
The combiner function doesn’t replace the reduce function.
It can help cut down the amount of data shuffled between the maps and the reduces
21
Scaling Out
Combiner Functions
Specifying a combiner function
• The combiner function is defined using the Reducer interface
• It is the same implementation as the reducer function in MaxTemperatureReducer.
• The only change is to set the combiner class on the JobConf.
22
Hadoop Streaming
Hadoop provides an API to MapReduce
write the map and reduce functions in languages other than Java.
We can use any language in MapReduce program.
Hadoop Streaming
Map input data is passed over standard input to your map function.
The map function processes the data line by line and writes lines to standard output.
A map output key-value pair is written as a single tab-delimited line.
Reduce function reads lines from standard input (sorted by key), and writes its results to
standard output.
23
Hadoop Streaming
Ruby
The map function can be expressed in Ruby.
Simulating the map function in Ruby with a Unix pipeline
The reduce function for maximum temperature in Ruby
24
Hadoop Streaming
Ruby
Simulating the whole MapReduce pipeline with a Unix pipeline
Hadoop command to run the whole MapReduce job
When using the combiner which is coded in any streaming language
25
Hadoop Streaming
Python
Streaming supports any programming language that can read from standard input and
write to standard output.
The map and reduce script in Python
Test the programs and run the job in the same way we did in Ruby.
26
Hadoop Pipes
Hadoop Pipes
The name of the C++ interface to Hadoop MapReduce.
Pipes uses sockets as the channel over which the tasktracker communicates with the
process running the C++ map or reduce function.
The source code for the map and reduce functions in C++
27
Hadoop Pipes
The source code for the map and reduce functions in C++
28
Hadoop Pipes
Compiling and Running
Makefile for C++ MapReduce program
Defining PLATFORM which specifies the operating system, architecture, and data
model (e.g., 32- or 64-bit).
To run a Pipes job, we need to run Hadoop (daemon) in pseudo-distributed model.
Next step is to copy the executable code (program) to HDFS.
Next, the sample data is copied from the local filesystem to HDFS.
29
Hadoop Pipes
Compiling and Running
Now, we can run the job. For this, we use the Hadoop pipes command, passing URI of
the executable in HDFS using the –program argument:
30