Transcript ppt

大规模数据处理/云计算
Lecture 2 – Mapreduce System
彭波
北京大学信息科学技术学院
4/22/2011
http://net.pku.edu.cn/~course/cs402/
Jimmy Lin
University of Maryland
课程建设
SEWMGroup
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Outline
• 2003 "The Google file system," in sosps. Bolton
Landing, NY, USA: ACM Press, 2003.
• 2004 "MapReduce: Simplified Data Processing
on Large Clusters," in Osdi, 2004,
2
MapReduce Basic
Typical Large-Data Problem

Iterate over a large number of records

Extract something of interest from each

Shuffle and sort intermediate results

Aggregate intermediate results

Generate final output
Key idea: provide a functional abstraction for these
two operations
4
(Dean and Ghemawat, OSDI 2004)
MapReduce

Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’ ’, v’ ’>*
 All values with the same key are sent to the same reducer

The execution framework handles everything else…
5
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c 3
k5 v5
k6 v6
map
c 6
a 5
map
c 2
b 7
c 8
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
6
MapReduce

Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
 All values with the same key are sent to the same reducer

The execution framework handles everything else…
What’s “everything else”?
7
MapReduce “Runtime”

Handles scheduling


Handles “data distribution”


Gathers, sorts, and shuffles intermediate data
Handles errors and faults


Moves processes to data
Handles synchronization


Assigns workers to map and reduce tasks
Detects worker failures and restarts
Everything happens on top of a distributed FS (later)
8
MapReduce

Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
 All values with the same key are reduced together

The execution framework handles everything else…

Not quite…usually, programmers also specify:
partition (k’, number of partitions) → partition for k’
 Often a simple hash of the key, e.g., hash(k’) mod n
 Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
 Mini-reducers that run in memory after the map phase
 Used as an optimization to reduce network traffic
9
k1 v1
k2 v2
map
a 1
k4 v4
map
b 2
c 3
combine
a 1
k3 v3
c 6
a 5
map
c 2
b 7
combine
c 9
partition
k6 v6
map
combine
b 2
k5 v5
a 5
partition
c 8
combine
c 2
b 7
partition
c 8
partition
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3
9 6
8 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
10
Two more details…

Barrier between map and reduce phases


But we can begin copying intermediate data earlier
Keys arrive at each reducer in sorted order

No enforced ordering across reducers
11
“Hello World”: Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, value);
12
MapReduce can refer to…

The programming model

The execution framework (aka “runtime”)

The specific implementation
Usage is usually clear from context!
13
MapReduce Implementations

Google has a proprietary implementation in C++


Hadoop is an open-source implementation in Java




Bindings in Java, Python
An Apache project
Large contribution of development led by Yahoo, used in
production
Rapidly expanding software ecosystem
Lots of custom research implementations

For GPUs, cell processors, etc.
14
User
Program
(1) submit
Master
(2) schedule map
(2) schedule reduce
worker
split 0
split 1
split 2
split 3
(5) remote read
(3) read
worker
worker
(6) write
output
file 0
(4) local write
split 4
worker
output
file 1
worker
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
15
Adapted from (Dean and Ghemawat, OSDI 2004)
How do we get data to the workers?
NAS
SAN
Compute Nodes
What’s the problem here?
16
Distributed File System

Don’t move data to workers… move workers to the data!



Why?



Store data on the local disks of nodes in the cluster
Start up the workers on the node that has the data local
Not enough RAM to hold all the data in memory
Disk access is slow, but disk throughput is reasonable
A distributed file system is the answer


GFS (Google File System) for Google’s MapReduce
HDFS (Hadoop Distributed File System) for Hadoop
17
GFS: Assumptions

Commodity hardware over “exotic” hardware


High component failure rates


Multi-gigabyte files are common, if not encouraged
Files are write-once, mostly appended to


Inexpensive commodity components fail all the time
“Modest” number of huge files


Scale “out”, not “up”
Perhaps concurrently
Large streaming reads over random access

High sustained throughput over low latency
18
GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions

Files stored as chunks


Reliability through replication


Simple centralized management
No data caching


Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata


Fixed size (64MB)
Little benefit due to large datasets, streaming reads
Simplify the API

Push some of the issues onto the client (e.g., data layout)
HDFS = GFS clone (same basic ideas)
19
From GFS to HDFS

Terminology differences:



GFS master = Hadoop namenode
GFS chunkservers = Hadoop datanodes
Functional differences:


No file appends in HDFS (planned feature)
HDFS performance is (likely) slower
For the most part, we’ll use the Hadoop terminology…
20
HDFS Architecture
HDFS namenode
Application
(file name, block id)
HDFS Client
/foo/bar
File namespace
block 3df2
(block id, block location)
instructions to datanode
(block id, byte range)
block data
datanode state
HDFS datanode
HDFS datanode
Linux file system
Linux file system
…
…
21
Adapted from (Ghemawat et al., SOSP 2003)
Namenode Responsibilities

Managing the file system namespace:


Coordinating file operations:



Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc.
Directs clients to datanodes for reads and writes
No data is moved through the namenode
Maintaining overall health:



Periodic communication with the datanodes
Block re-replication and rebalancing
Garbage collection
22
Putting everything together…
namenode
job submission node
namenode daemon
jobtracker
tasktracker
tasktracker
tasktracker
datanode daemon
datanode daemon
datanode daemon
Linux file system
Linux file system
Linux file system
…
slave node
…
slave node
…
slave node
23
References

2003 "The Google file system," in sosps. Bolton Landing,
NY, USA: ACM Press, 2003.

2004 "MapReduce: Simplified Data Processing on Large
Clusters," in Osdi, 2004,
24
Q&A?
Thanks you!