Software engineering notes

cloudera training > MapReduce and HDFS > map-reduce overview

leave a comment »


– borrows from functional programming: map, reduce
– provides an interface for map/reduce; we must implement the interface
– map
— the mapper can emit an arbitrary pair, not necessarily the input key/val
— the mapper runs simultaneously on multiple machines; the first to complete is used
— each map runs in its own jvm
— each run in parallel
— input is usualy 64MB – 128MB chunks (results in more streaming)

– reduce
— the number of reduces that run corresponds to the number of output files
— ideally, we want 1 reduce
— run in paralllel

– flow
— data store of k/v pairs > map > barrier (shuffle phase) > reduce > result
– chained map-reduce jobs are common
– all values are processed independently
– bottleneck: now reduce can run until all maps are finished
– combiner
— runs immediately after mapper on map node
— can use reducer function if reducer is commutative and associative

– conclusions
— mapreduce is a useful abstraction
— simplifies large scale comp
— lets the programmer focus on the problem and the library handle the details of distribution

Written by Erik

June 11, 2009 at 9:27 am

Posted in notes

Tagged with , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: