Transcript MapReduce

MapReduce
Google and MapReduce





Google searches billions of web pages very, very
quickly
How?
It uses a technique called “MapReduce” to distribute
the work across a large number of computers, then
combine the results
This has made MapReduce a very popular approach
Hadoop is an open source implementation of
MapReduce

Unless you work for Google, you will probably use Hadoop
2
How it works


List(a, b, c, …).map(x => f(x)) gives List(f(a), f(b), f(c),…)
List(a, b, c, …).reduce((x, y) => x  y) gives a  b  c …
where  is some binary operator
3
Another view

http://www.cnblogs.com/sharpxiajun/p/3151395.html
(in Japanese)
4
ForkJoin


How does ForkJoin differ from MapReduce?
Answers from stackoverflow:




ForkJoin recursively partitions a task into several subtasks, on
a single machine. Takes advantage of multiple cores
MapReduce only does one big split, with no communication
between the parts until the reduce step. Massively scalable.
Java fork/join starts quickly and scales well for small inputs
(<5MB), but it cannot process larger inputs due to the size
restrictions of shared-memory, single node architectures.
MapReduce takes tens of seconds to start up, but scales well
for much larger inputs (>100MB) on a compute cluster.
5
The End
6