Transcript Slide 1

M3R
Main Memory Map Reduce (M3R)
VLDB, August 2012 (to appear)
Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat
IBM Research
In collaboration with:
Yan Li*, Dave Grove, Mikio Takeuchi**, Salikh Zakirov**, Juemin
Zhang
1
* IBM China Research Lab
** IBM Tokyo Research Lab
Otherwise IBM TJ Watson Research Lab, New York
© 2009 IBM Corporation
M3R
M3R/Hadoop
 Hadoop
– Popular Java API for Map/Reduce programming
– Out of core, resilient, scalable (1000 nodes)
– Based on HDFS (a resilient distributed filesystem)
IBM Research
 M3R/Hadoop
– Reimplementation of Hadoop API using managed X10
– Existing Hadoop applications just work
– Reuse HDFS (and some other parts of Hadoop)
– In-memory: problem size must fit in cluster RAM
– Not resilient: cluster scales until MTBF barrier
– But considerably faster (closer to HPC speeds)
© 2009 IBM Corporation
M3R
System ML performance results
IBM Research
PageRank Hadoop M3R
3
100K
924s
145s
Linear
Hadoop
Regression
200K
925s
213s
1M
1334s
232s
400K
952s
580s
3M
1581s
362s
5M
1573s
460s
GNNMF
Hadoop M3R
100K
1567s
162s
200K
1585s
174s
400K
1590s
201s
M3R
Iterative sparse matrix algorithms, implemented in DML
(Executed with System ML)
Run on 20 x86 nodes (each 8 core 16GB ram)
© 2009 IBM Corporation
M3R
Sparse Matrix * Dense Vector performance
Sparse Matrix Vector Multiplication
Sparse Matrix Vector Multiplication
2000
45
M3R/Hadoop
Hadoop
Expon. (M3R/Hadoop)
Expon. (Hadoop)
1800
1600
35
1400
30
Time (s)
1200
Time (s)
M3R/Hadoop
Expon. (M3R/Hadoop)
40
1000
25
20
800
15
600
10
400
5
200
0
0
0
200000
400000
600000
800000
1000000 1200000 1400000 1600000
0
200000
IBM Research
Size M (G is an MxM matrix with sparsity 0.001)
4
400000
600000
800000
1000000 1200000 1400000 1600000
Size M (G is an MxM matrix of sparsity 0.001)
Results for our sparse MatVecMult code running on Hadoop
and M3R/Hadoop
Algorithm is specially tailored for M3R
Approx 50x speedup
© 2009 IBM Corporation
M3R
Architecture
JVM only
multiple jobs
Java Hadoop App
multiple jobs
Hadoop
Map Reduce
Engine
HDFS
M3R/Hadoop
adaptor
HDFS data
JVM/Native
HDFS
IBM Research
X10 M3R jobs
5
M3R
Engine
Java M3R jobs
X10
Java
© 2009 IBM Corporation
M3R
Speeding up Iterative Hadoop Map Reduce Jobs
 Reducing Disk I/O
 Reducing network communication
 Reducing serialization/deserialization
IBM Research
– E.g. about 25 seconds for 1x1M sparse matrix
6
© 2009 IBM Corporation
M3R
Presentation Note
In the pictures that follow, BLUE lines represent
slow communication paths.
BLACK lines represent fast in-memory aliasing
IBM Research
The M3R goal is to turn BLUE lines to BLACK to
get performance.
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) Hadoop Job
With an emphasis on Disk I/O
File System
(HDFS)
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
IBM Research
RecordReader/
8
InputSplit)
RecordWriter
File
System
Shuffle
File
OutputCommitter)
System
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) M3R/Hadoop Job
With an emphasis on Disk I/O
File System
IBM Research
(HDFS)
9
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) M3R/Hadoop Job
With an emphasis on Disk I/O
File System
(HDFS)
IBM Research
Cache
10
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) Hadoop Job
With an emphasis on Network I/O
File System
(HDFS)
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
IBM Research
RecordReader/
11
InputSplit)
RecordWriter
File
System
Shuffle
File
OutputCommitter)
System
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) Hadoop Job
With an emphasis on Network I/O
File System
(HDFS)
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
IBM Research
RecordReader/
12
InputSplit)
RecordWriter
File
System
Shuffle
File
OutputCommitter)
System
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) M3R/Hadoop Job
With an emphasis on Network I/O
File System
(HDFS)
IBM Research
Cache
13
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) M3R/Hadoop Job
With an emphasis on Network I/O
File System
(HDFS)
IBM Research
Cache
14
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
© 2009 IBM Corporation
M3R
Basic Flow for an (Iterative) M3R/Hadoop Job
With an emphasis on Network I/O
File System
(HDFS)
IBM Research
Cache
15
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
© 2009 IBM Corporation
M3R
IBM Research
Map/Shuffle/Reduce
16
Map
Reduce
(Mapper)
(Reducer)
Shuffle
© 2009 IBM Corporation
M3R
Mappers/Shuffle/Reducers
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Mapper4
17
Shuffle
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
Co-locating Mappers and Reducers
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Mapper4
18
Shuffle
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
Co-locating Mappers and Co-locating Reducers
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Shuffle
19
Mapper4
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
Hadoop Broadcast
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Shuffle
20
Mapper4
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
M3R Broadcast via De-Duplication
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Shuffle
21
Mapper4
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
M3R Broadcast via De-Duplication
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Shuffle
22
Mapper4
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
Iterated Matrix Vector multiplication
IBM Research
Algorithm (Row block partitioned G, V)
 Replicate V
– In parallel, each place broadcasts its segment of V to all others
 In parallel, at each place, multiply each row of G with V.
– Yields a new distributed V
Key to performance:
 Read the appropriate part of G once
Also subdivide horizontally
for out-of-core
 never communicate G
 Communicate only to replicate V
23
V
V
G
*
V
=
V
© 2009 IBM Corporation
M3R
Iterated Matrix Vector multiplication in Hadoop
Map/
Input (G)
Pass (G)
Map/
Input (V)
Bcast (V)
Shuffle Reducer (*) Output V#
File System
IBM Research
(HDFS)
24
Input
(V#)
Map/
Shuffle
#
Reducer (+) Output V’
Pass (V )
© 2009 IBM Corporation
M3R
Iterated Matrix Vector multiplication
IBM Research
Algorithm (Row block partitioned G, V)
 Replicate V
– In parallel, each place broadcasts its segment of V to all others
 In parallel, at each place, multiply each row of G with V.
– Yields a new distributed V
Key to performance:
 Read the appropriate part of G once
Also subdivide horizontally
for out-of-core
 never communicate G
 Communicate only to replicate V
25
V
V
G
*
V
=
V
© 2009 IBM Corporation
M3R
Iterated Matrix Vector multiplication in M3R
Map/
Input (G)
Pass (G)
Map/
Input (V)
Bcast (V)
File System
Shuffle Reducer (*)
Cache
IBM Research
(HDFS)
26
Map/
Shuffle
#
Reducer (+) Output V’
Pass (V )
© 2009 IBM Corporation
M3R
Iterated Matrix Vector multiplication in M3R
Map/
Input (G)
Pass (G)
Map/
Input (V)
Bcast (V)
File System
IBM Research
(HDFS)
27
Do not communicate G
Shuffle Reducer (*)
Cache
Do no communication
Map/
Shuffle
#
Reducer (+) Output V’
Pass (V )
© 2009 IBM Corporation
M3R
IBM Research
Partition Stability in M3R
28
The reducer associated with a given
partition number will always be run at
the same place
Same place => Same memory
Can reuse existing datastructures
User can control local vs remote
communications
© 2009 IBM Corporation
M3R
An M3R/Hadoop Job that Exploits Locality
File System
(HDFS)
IBM Research
Cache
29
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
Partitioner
© 2009 IBM Corporation
M3R
Iterated Matrix Vector multiplication in M3R
Map/
Input (G)
Pass (G)
Map/
Input (V)
Bcast (V)
File System
IBM Research
(HDFS)
30
Do not communicate G
Shuffle Reducer (*)
Cache
Do no communication
Map/
Shuffle
#
Reducer (+) Output V’
Pass (V )
© 2009 IBM Corporation
M3R
Conclusions
 Sacrifice resilience and out-of-core execution
 Gain performance
 Used X10 to build a fast map/reduce engine
 Used X10/Java interop to wrap with Hadoop API
 Used X10 features to implement distributed cache
IBM Research
– Avoid serialization, disk, network I/O costs
 10x faster for unmodified Hadoop apps (System ML)
 50x faster for Hadoop app designed for M3R
© 2009 IBM Corporation
M3R
IBM Research
Backup Slides
32
© 2009 IBM Corporation
M3R
IBM Research
Conclusions
 M3R is operational.
– Multi-JVM, multi-threaded, main
memory map reduce implementation
written in X10
– Runs Java Hadoop 0.20.0 mapred
jobs which may use HDFS
– Also supports API that permits
mappers/reducers to operate on
objects pre-positioned in global
memory
 Java Hadoop jobs can be written to
take advantage of M3R features
– Cloning: Make keys/values
immutable so they don’t need to be
cloned
– Caching: Engine caches key-value
pairs associated with files (avoid
(de)-serialization, I/O costs)
– Partition Stability: Key value pairs
with same partition number go to
same JVM, across jobs
 M3R can run DML programs
unchanged
– Some changes needed to DML
compiler to teach it to use caching
file system.
– Speedups range from 1.6x to 13x
– (Measurements on 8-core, 20 node
cluster)
– (More tests are under way.)
 Matrix Vector multiply
implementation (in Java Hadoop
0.20.2) shows cycle time
improvements from 9x to 47x.
– Exploits cloning, caching, partition
stability.
 Better performance possible with
code written to M3R APIs
© 2009 IBM Corporation
M3R
M3R/Hadoop limitations
IBM Research
 The entire mapper output for a given
job must fit in available memory.
– May be possible to relax this in the
future
34
 The mapper and reducer code must
be safe for multi-threaded execution.
– In particular, the use of static
variables is suspect.
– Code that runs correctly using
Hadoop's multithreaded maprunner
is probably fine.
– M3R will support single-threaded
places (may already)
 Code should not assume JVM
will be restarted in-between
tasks.
– Clean up after you.
 For now, only supports mapred
0.20.2.
 No failure resilience
© 2009 IBM Corporation
M3R
X10: An Evolution of Java for the Scale-Out Era
 Java
Java
 How do you deal with peta-bytes of data?
 How do you take advantage of GPUs and
FPGAs?
 X10 – Performance and Productivity at Scale
Managed Place
Java
Java
APGAS
Java
…
APGAS
Place 0
Place 1
X10
APGAS
X10
Place 34
X10
APGAS
IBM Research
Native Place
APGAS
X10
…
APGAS
X10
X10
GPU, FPGA
© 2009 IBM Corporation
M3R
X10 and the APGAS model
 Five basic constructs
– async S – run S as a separate activity
– at (P) S – switch to place P to run S.
– finish S – execute S, wait for
termination
– when (c) S – execute S when c,
atomically
class HelloWholeWorld {
– clocked async, clocked finish support
public static def main(s:Array[String](1)):void {
barriers
Asychrony
IBM Research
• async S
}
Locality
• at (P) S
}
finish
for (p in Place.places())
async
at (p) Global data Cilk-style work-stealing scheduler
Atomicity
Console.OUT.println("(At
" + p + ") "
structures
• when (c) S
+ s(0));
• points, regions,
Order
• finish S
• clocks
distributions,
arrays
 Runs on modern interconnects
– Collectives exploit hardware support
– RDMA transfer support
 Runs on Blue Gene, x86, Power..
 Runs natively and in JVM
Java-like productivity, MPI-like performance
© 2009 IBM Corporation
M3R
M3R Goals
Fast multi-node, multi-threaded Map Reduce for
IBM Research
clusters with high MTBF optimized for iterative jobs.
37
 + Support Hadoop mapred
0.20.2 API with minimal uservisible changes.
– Same job should be able to
run on Hadoop and M3R.
 Perform well on scale-up SMP
 Ensure that DML programs can
run unchanged on Hadoop or
M3R.
 Jaql, Nimble, …
 + Support HDFS access
© 2009 IBM Corporation
M3R
Reading G into the Correct Partitions: Preloading
File System
Cache
(HDFS)
IBM Research
Input
38
Map
Reduce
Pass G through
Pass G through
G
Output
Partitioned_G
Shuffle
Partitioner
© 2009 IBM Corporation
M3R
Preload performance
Sparse Matrix Vector multiplication runtime
3000
2500
Ti me (s)
2000
M3R/Hadoop
M3R/Hadoop+preload
1500
Hadoop
Hadoop+preload
1000
500
IBM Research
0
39
100k
200k
400k
800k
1000k
1200k
1500k
Size M (G is an MxM matrix with sparsity 0.001)
 Cost reflects one time preload
cost + 3 iterations cost.
 Preload costs decrease with
number of iterations
© 2009 IBM Corporation
M3R
Reading G into the Correct Partitions: PlacedSplit
IBM Research
class MatrixInputSplit implements
InputSplit,
PlacedSplit {
int getPartition(…) {
}
…
}
40
M3R/Hadoop prioritizes PlacedSplit
requests over HDFS localiity
© 2009 IBM Corporation
M3R
Hadoop Serialization and Mutation
File System
(HDFS)
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
IBM Research
RecordReader/
41
InputSplit)
RecordWriter
File
System
Shuffle
File
OutputCommitter)
System
© 2009 IBM Corporation
M3R
Encouraging Mutation: WordCount
IBM Research
public static class Map extends Mapper<?,?,?,?> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
42
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
© 2009 IBM Corporation
M3R
Aliasing and Mutation
File System
(HDFS)
IBM Research
Cache
43
Input
Map
Reduce
Output
(InputFormat/
(Mapper)
(Reducer)
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
Partitioner
© 2009 IBM Corporation
M3R
ImmutableOutputs
File System
(HDFS)
Cache
Input
IBM Research
(InputFormat/
44
Mapper implements
Reducer implements
ImmutableOutput
ImmutableOutput
Output
(OutputFormat/
RecordReader/
RecordWriter
InputSplit)
OutputCommitter)
Shuffle
Partitioner
© 2009 IBM Corporation
M3R
IBM Research
M3R foundation
45
 At its heart, a core X10 M3R
engine allows jobs to “pin” data
in distributed memory using
standard X10 features.
 Over this, we built an interop
layer that consumes Hadoop
jobs and runs them against the
core engine.
– The cache and the partition
stability guarantee mediate
between these worlds
 Allow a Hadoop programmer
to pin data in memory
 Writing directly to the X10 M3R
interface offers the opportunity
for increased performance.
– X10 code for MatVecMult
performs 10x better
(Managed backend,
sockets)
– Native code (on sockets)
30x better
© 2009 IBM Corporation
M3R
Mappers/Shuffle/Reducers
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Mapper4
46
Shuffle
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
Connecting Mappers and Reducers
Mapper1
Reducer1
Mapper2
Reducer2
Mapper3
Reducer3
IBM Research
Mapper4
47
Shuffle
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
© 2009 IBM Corporation
M3R
Partitioner: Connecting Mappers and Reducers
Mapper1
Reducer1
Mapper2
Reducer2
Shuffle
Mapper3
IBM Research
Mapper4
48
Reducer3
Partitioner
Reducer4
Mapper5
Reducer5
Mapper6
Reducer6
int partitionNumber = getPartition(key, value);
© 2009 IBM Corporation