cloudera training > MapReduce and HDFS > map-reduce overview

ref: http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf

– borrows from functional programming: map, reduce
– provides an interface for map/reduce; we must implement the interface
– map
— the mapper can emit an arbitrary pair, not necessarily the input key/val
— the mapper runs simultaneously on multiple machines; the first to complete is used
— each map runs in its own jvm
— each run in parallel
— input is usualy 64MB – 128MB chunks (results in more streaming)

– reduce
— the number of reduces that run corresponds to the number of output files
— ideally, we want 1 reduce
— run in paralllel

– flow
— data store of k/v pairs > map > barrier (shuffle phase) > reduce > result
– chained map-reduce jobs are common
– all values are processed independently
– bottleneck: now reduce can run until all maps are finished
– combiner
— runs immediately after mapper on map node
— can use reducer function if reducer is commutative and associative

– conclusions
— mapreduce is a useful abstraction
— simplifies large scale comp
— lets the programmer focus on the problem and the library handle the details of distribution