Partitioners and Combiners

Download Report

Transcript Partitioners and Combiners

MAP REDUCE BASICS
CHAPTER 2
Basics
• Divide and conquer
– Partition large problem into smaller subproblems
– Worker work on subproblems in parallel
• Threads in a core, cores in multi-core processor,
multiple processor in a machine, machines in a cluster
– Combine intermediate results from worker to final result
– Issues
• How break up into smaller tasks
• Assign tasks to workers
• Workers get data needed
• Coordinate synchronization among workers
• Share partial results
• Do all if SE errors and HW faults?
Basics
• MR – abstraction that hides system-level
details from programmer
• Move code to data
– Spread data across disks
– DFS manages storage
Topics
• Functional programming
• MapReduce
• Distributed file system
Functional Programming Roots
• MapReduce = functional programming plus distributed
processing on steroids
– Not a new idea… dates back to the 50’s (or even 30’s)
• What is functional programming?
– Computation as application of functions
– Computation is evaluation of mathematical functions
– Avoids state and mutable data
– Emphasizes application of functions instead of changes in
state
Functional Programming Roots
• How is it different?
– Traditional notions of “data” and “instructions” are not
applicable
– Data flows are implicit in program
– Different orders of execution are possible
– Theoretical foundation provided by lambda calculus
• a formal system for function definition
• Exemplified by LISP, Scheme
Overview of Lisp
• Functions written in prefix notation
(+ 1 2)
(* 3 4)
(sqrt (
(define
(* x 5)
 3
 12
+ (* 3 3) (* 4 4)))  5
x 3)  x
 15
Functions
• Functions = lambda expressions bound to variables
Example expressed with lambda:(+ 1 2)  3
(define foo
(lambda (x y)
(sqrt (+ (* x x) (* y y)))))
• Above expression is equivalent to:
(define (foo x y)
(sqrt (+ (* x x) (* y y))))
• Once defined, function can be applied:
(foo 3 4)  5
λxλy.x+y
Functional Programming Roots
• Two important concepts in functional
programming
– Map: do something to everything in a list
– Fold: combine results of a list in some way
Functional Programming Map
• Higher order functions – accept other functions as arguments
– Map
• Takes a function f and its argument, which is a list
• applies to all elements in list
• Returns a list as result
• Lists are primitive data types
– [1 2 3 4 5]
– [[a 1] [b 2] [c 3]]
Map/Fold in Action
• Simple map example:
(map (lambda (x) (* x x)) [1 2 3 4 5])  [1 4 9 16 25]
Functional Programming Reduce
– Fold
• Takes function g, which has 2 arguments: an initial
value and a list.
• The g applied to initial value and 1st item in list
• Result stored in intermediate variable
• Intermediate variable and next item in list 2nd
application of g, etc.
• Fold returns final value of intermediate variable
Map/Fold in Action
• Simple map example:
(map (lambda (x) (* x x)) [1 2 3 4 5])  [1 4 9 16 25]
• Fold examples:
(fold + 0 [1 2 3 4 5])  15
(fold * 1 [1 2 3 4 5])  120
• Sum of squares:
(define (sum-of-squares v) // where v is a list
(fold + 0 (map (lambda (x) (* x x)) v)))
(sum-of-squares [1 2 3 4 5])  55
Functional Programming Roots
•
•
•
•
•
Use map/fold in combination
Map – transformation of dataset
Fold- aggregation operation
Can apply map in parallel
Fold – more restrictions, elements must be
brought together
– Many applications do not require g be applied to
all elements of list, fold aggregations in parallel
Functional Programming Roots
• Map in MapReduce is same as in functional
programming
• Reduce corresponds to fold
• 2 stages:
– User specified computation applied over all input,
can occur in parallel, return intermediate output
– Output aggregated by another user-specified
computation
Mappers/Reducers
• Key-value pair (k,v) – basic data structure in
MR
• Keys, values – int, strings, etc., user defined
– e.g. keys – URLs, values – HTML content
– e.g. keys – node ids, values – adjacency lists of
nodes
Map: (k1, v1) -> [(k2, v2)]
Reduce: (k2, [v2]) -> [(k3, v2)]
Where […] denotes a list
General Flow
• Apply mapper to every input key-value pair stored in
DFS
• Generate arbitrary number of intermediate (k,v)
• Distributed group by operation (shuffle) on intermediate
keys
• Sort intermediate results by key (not across reducers)
• Aggregate intermediate results
• Generate final output to DFS – one file per reducer
What function is implemented?
Example: unigram (word count)
• (docid, doc) on DFS, doc is text
• Mapper tokenizes (docid, doc), emits (k,v) for
every word – (word, 1)
• Execution framework all same keys brought
together in reducer
• Reducer – sums all counts (of 1) for word
• Each reduce writes to one file
• Words within file sorted, file same # words
• Can use output as input to another MR
Combine - Bandwidth Optimization
• Issue: large number of key-value pairs
– Example – word count (word, 1)
– If copy across network intermediate data > input
• Solution: use Combiner functions
– allow local aggregation (after mapper) before shuffle sort
• Word Count - Aggregate (count each word locally)
– intermediate = # unique words
– Executed on same machine as mapper – no output from
other mappers
– Results in a “mini-reduce” right after the map phase
– (k,v) of same type as input/output
– If operation associative and commutative, reduce can be
combiner
• Reduces key-value pairs to save bandwidth
Partitioners – Load Balance
• Issue: Intermediate results all on one reducer
• Solution: use Partitioner functions
– divide up intermediate key space and assign (k,v) to
reducers
– Specifies task to which copy (k,v)
– Reducer processes keys in sorted order
– Partitioner computes hash value of key, takes mod of
value with # reducers
• Hopefully same number of each to each reducer
• But may be- Zipfian
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All v’ with the same k’ are reduced together
• Usually, programmers also specify:
partition (k’, number of partitions ) → partition for k’
– Often a simple hash of the key, e.g. hash(k’) mod n
– Allows reduce operations for different keys in parallel
• Implementations:
– Google has a proprietary implementation in C++
– Hadoop is an open source implementation in Java (lead by
Yahoo)
Its not just Map and Reduce
• Apply mapper to every input key-value pair stored in
DFS
• Generate arbitrary number of intermediate (k,v)
• Aggregate locally
• Assign to reducers
• Distributed group by operation (shuffle) on intermediate
keys
• Sort intermediate results by key (not across reducers)
• Aggregate intermediate results
• Generate final output to DFS – one file per reducer
Execution Framework
• MapReduce program (job) contains
•
•
•
•
•
Code for mappers
Combiners
Partitioners
Code for reducers
Configuration parameters (where is input, store output)
– Execution framework takes care of everything else
– Developer submits job to submission node of
cluster (jobtracker)
Recall these problems?
•
•
•
•
•
•
How do we assign work units to workers?
What if we have more work units than workers?
What if workers need to share partial results?
How do we aggregate partial results?
How do we know all the workers have finished?
What if workers die?
Execution Framework
• Scheduling
– Job divided into tasks (certain block of (k,v) pairs)
– Can have 1000s jobs need to be assigned
– May exceed number that can run concurrently
– Task queue
– Coordination among tasks from different jobs
Execution Framework
• Speculative execution
• Map phase only as fast as?
– slowest map task
• Problem: Stragglers, flaky hardware
• Solution: Use speculative execution:
– Exact copy of same task on different machine
– Uses result of fastest task in attempt to finish
– Better for map or reduce?
– Can improve running time by 44% (Google)
– Doesn’t help if skewed in distributed of values
Execution Framework
• Data/code co-location
– Execute near data
– It not possible must stream data
• Try to keep within same rack
Execution Framework
• Synchronization
– Concurrently running processes join up
– Intermediate (k,v) grouped by key,
copy intermediate data over network, shuffle/sort
• Number of copy operations? Worst case:
– M X R copy operations
• Each mapper may send intermediate results to every reducer
– Reduce computation cannot start until all mappers
finished, (k,v) shuffled/sorted
• Differs from functional programming
– Can copy intermediate (k,v) over network to reducer
when mapper finishes
Execution Framework
• Error/fault handling
– The norm
– Disk failures, RAM errors, datacenter outages
– Software errors
– Corrupted data
Differences in MapReduce
Implementations
• Hadoop (Apache) vs. Google
– Hadoop - Values arbitrarily ordered, can change
key in reducer
– Google – program can specify 2ndary sort, can’t
change key in reducer
• Hadoop
– Programmer can specify number of map tasks, but
framework makes final decision
– In reduce, programmer specified number of tasks
is used
Hadoop
• Careful using external resources (e.g. bottleneck querying SQL
DB)
• Mappers can emit arbitrary number of intermediate (k,v), can
be of different type
• Reduce can emit artibtraty number of final (k,v) and can be of
different type than intermediate (k,v)
• Different from functional programming, can have side effects
(state change internal – may cause problems, external may
write to files)
• MapReduce can have no reduce, but must have mapper
– Can just pass identity function to reducer
– May not have any input (compute pi)
Other Sources
• Other source can serve as source/destination
for data from MapReduce
– Google – BigTable
– Hbase – BigTable clone
– Hadoop – integrated RDB with parallel processing,
can write to DB tables
Distributed File System (DFS)
• In HPC, storage distinct from computation
• NAS (network attached storage) and SAN are
common
– Separate, dedicated nodes for storage
• Fetch, load, process, write
• Bottleneck
– Higher performance networks $$ (10G Ethernet),
special purpose interconnects $$$ (InfiniBand)
• $$ increases non-linearly
– In GFS Computation and storage not distinct
components
Hadoop Distributed File System - HDFS
• GFS supports proprietary MapReduce
• HDFS – supports Hadoop
• Don’t have to run GFS on DFS, but misses
advantages
• Difference in GFS and HDFS vs. DFS:
– Adapted to large data processing
– divide user data into chunks/blocks - LARGE
– Replicate these across the local disk nodes in
cluster
– Master-slave architecture
HDFS vs GFS (Google File System)
• Difference in HDFS:
– Master-slave architecture
• GFS: Master (master), slave (chunkserver)
• HDFS: master (namenode), slave (datanode)
– Master – namespace (metadata, directory structure, file to
block mapping, location of blocks, access permission)
– Slaves – manage actual data blocks
– Client contacts namespace, gets data from slaves, 3 copies of
each block, etc.
– Block is 64 MB
– Initially Files were immutable – once closed cannot be
modified
HDFS
• Namenode
– Namespace management
– Coordinate file operations
• Lazy garbage collection
– Maintain file system health
• Heartbeats, under-replication, balancing
• Supports subset of POSIX API, pushed to
application
• No Security
Hadoop Cluster Architecture
• HDFS namenode runs daemon
• Job submission node runs jobtracker
– point of contact run MapReduce
– Monitors progress of MapReduce jobs,
coordinates Mappers and reducers
• Slaves run tasktracker
– Runs users code, datanode daemon, serve HDFS
data
– Send heartbeat messages to jobtracker
Hadoop Cluster Architecture
• Number of reduce tasks depends on reducers
specified by programmer
• Number of map tasks depends on
– Hint from programmer
– Number of input files
– Number of HDFS data blocks of files
Hadoop Cluster Architecture
• Map tasks assigned
– (k,v) called input split
• Input splits computed automatically
• Aligned on HDFS boundaries so associated with single
block, simplifies scheduling
• Data locality, if not stream across network (same rack if
possible)
• How can we use MapReduce to solve
problems?
Hadoop Cluster Architecture
• Mappers in Hadoop
– Javaobjects with a MAP method
– Mapper object instantiated for every map task by
tasktracker
– Life cycle – instantiation, hook in API for program
specified code
• Mappers can load state, static data sources, dictionaries, etc.
– After initialization: MAP method called by framework
on all (k,v) in input split
– Method calls within same Java object, can preserve
state across multiple (k,v) in same task
– Can run programmer specified termination code
Hadoop Cluster Architecture
• Reducers in Hadoop
– Execution similar to that of mappers
• Instantiation, initialization, framework calls REDUCE
method with intermediate key and iterator over all key
values
• Intermediate keys in sorted order
• Can preserve state across multiple intermediate keys
CAP Theorem
• Consistency, availability, partition tolerance
• Cannot satisfy all 3
• Partitioning unavoidable in large data systems,
must trade off availability and consistency
– If master fails, system is unavailable so consistent!
– If multiple masters, more available, but inconsistent
• Workaround to single namenode
– Warm standby namenode
– Hadoop community working on it