Transcript ppt
大规模数据处理/云计算
Lecture 2 – Mapreduce System
彭波
北京大学信息科学技术学院
4/22/2011
http://net.pku.edu.cn/~course/cs402/
Jimmy Lin
University of Maryland
课程建设
SEWMGroup
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Outline
• 2003 "The Google file system," in sosps. Bolton
Landing, NY, USA: ACM Press, 2003.
• 2004 "MapReduce: Simplified Data Processing
on Large Clusters," in Osdi, 2004,
2
MapReduce Basic
Typical Large-Data Problem
Iterate over a large number of records
Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Key idea: provide a functional abstraction for these
two operations
4
(Dean and Ghemawat, OSDI 2004)
MapReduce
Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’ ’, v’ ’>*
All values with the same key are sent to the same reducer
The execution framework handles everything else…
5
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c 3
k5 v5
k6 v6
map
c 6
a 5
map
c 2
b 7
c 8
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
6
MapReduce
Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
All values with the same key are sent to the same reducer
The execution framework handles everything else…
What’s “everything else”?
7
MapReduce “Runtime”
Handles scheduling
Handles “data distribution”
Gathers, sorts, and shuffles intermediate data
Handles errors and faults
Moves processes to data
Handles synchronization
Assigns workers to map and reduce tasks
Detects worker failures and restarts
Everything happens on top of a distributed FS (later)
8
MapReduce
Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
All values with the same key are reduced together
The execution framework handles everything else…
Not quite…usually, programmers also specify:
partition (k’, number of partitions) → partition for k’
Often a simple hash of the key, e.g., hash(k’) mod n
Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
Mini-reducers that run in memory after the map phase
Used as an optimization to reduce network traffic
9
k1 v1
k2 v2
map
a 1
k4 v4
map
b 2
c 3
combine
a 1
k3 v3
c 6
a 5
map
c 2
b 7
combine
c 9
partition
k6 v6
map
combine
b 2
k5 v5
a 5
partition
c 8
combine
c 2
b 7
partition
c 8
partition
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3
9 6
8 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
10
Two more details…
Barrier between map and reduce phases
But we can begin copying intermediate data earlier
Keys arrive at each reducer in sorted order
No enforced ordering across reducers
11
“Hello World”: Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, value);
12
MapReduce can refer to…
The programming model
The execution framework (aka “runtime”)
The specific implementation
Usage is usually clear from context!
13
MapReduce Implementations
Google has a proprietary implementation in C++
Hadoop is an open-source implementation in Java
Bindings in Java, Python
An Apache project
Large contribution of development led by Yahoo, used in
production
Rapidly expanding software ecosystem
Lots of custom research implementations
For GPUs, cell processors, etc.
14
User
Program
(1) submit
Master
(2) schedule map
(2) schedule reduce
worker
split 0
split 1
split 2
split 3
(5) remote read
(3) read
worker
worker
(6) write
output
file 0
(4) local write
split 4
worker
output
file 1
worker
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
15
Adapted from (Dean and Ghemawat, OSDI 2004)
How do we get data to the workers?
NAS
SAN
Compute Nodes
What’s the problem here?
16
Distributed File System
Don’t move data to workers… move workers to the data!
Why?
Store data on the local disks of nodes in the cluster
Start up the workers on the node that has the data local
Not enough RAM to hold all the data in memory
Disk access is slow, but disk throughput is reasonable
A distributed file system is the answer
GFS (Google File System) for Google’s MapReduce
HDFS (Hadoop Distributed File System) for Hadoop
17
GFS: Assumptions
Commodity hardware over “exotic” hardware
High component failure rates
Multi-gigabyte files are common, if not encouraged
Files are write-once, mostly appended to
Inexpensive commodity components fail all the time
“Modest” number of huge files
Scale “out”, not “up”
Perhaps concurrently
Large streaming reads over random access
High sustained throughput over low latency
18
GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions
Files stored as chunks
Reliability through replication
Simple centralized management
No data caching
Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata
Fixed size (64MB)
Little benefit due to large datasets, streaming reads
Simplify the API
Push some of the issues onto the client (e.g., data layout)
HDFS = GFS clone (same basic ideas)
19
From GFS to HDFS
Terminology differences:
GFS master = Hadoop namenode
GFS chunkservers = Hadoop datanodes
Functional differences:
No file appends in HDFS (planned feature)
HDFS performance is (likely) slower
For the most part, we’ll use the Hadoop terminology…
20
HDFS Architecture
HDFS namenode
Application
(file name, block id)
HDFS Client
/foo/bar
File namespace
block 3df2
(block id, block location)
instructions to datanode
(block id, byte range)
block data
datanode state
HDFS datanode
HDFS datanode
Linux file system
Linux file system
…
…
21
Adapted from (Ghemawat et al., SOSP 2003)
Namenode Responsibilities
Managing the file system namespace:
Coordinating file operations:
Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc.
Directs clients to datanodes for reads and writes
No data is moved through the namenode
Maintaining overall health:
Periodic communication with the datanodes
Block re-replication and rebalancing
Garbage collection
22
Putting everything together…
namenode
job submission node
namenode daemon
jobtracker
tasktracker
tasktracker
tasktracker
datanode daemon
datanode daemon
datanode daemon
Linux file system
Linux file system
Linux file system
…
slave node
…
slave node
…
slave node
23
References
2003 "The Google file system," in sosps. Bolton Landing,
NY, USA: ACM Press, 2003.
2004 "MapReduce: Simplified Data Processing on Large
Clusters," in Osdi, 2004,
24
Q&A?
Thanks you!