notes from cloudera basic training > hadoop mapreduce deep dive

//we moved quickly through this, so the notes are sparse
– job
— a full program

– task
— by default, hadoop creates the same amount of tasks as there are input blocks

— task attempts
— tasks are attempted at least once
— multiple attempts in parellel are performed w/ speculative execution turned on

– tasktracker
— forks jvm process for each task

– job distribution
— mapreduce programs = jar + xml config
— running a job puts jar and xml in hdfs

– data distribution
— data locality decreases when multiple tasks are running

– mapreduce flow
— client creates joconf
— identify map and reducer classes
— specify inputs/outputs
— set optional settings
— job launches jobclient
— runjob blcks until the job completes
— submitjob is non-blocking
— …
— tasttracker
— perioducally query jobtracker for work
— …
— write for cache coherency (re-use objects in loops(?))
— reusing memory locations => 2x speed-up
— all k/v pairs given by hadoop use this model
//is avro comparable to thrift?

– getting data to mapper
— data sets are specified
— input sets contain at least 1 record and are composed of full blocks

– file input format
— most people use SequenceFileInputFormat
— usually we store all our data in hdfs and then ignore what we don’t need, rather than spending time formatting the data when it’s input

— …

– shuffling
— what happens btwn map and reduce

– write the output
— OutputFormat is analagous to InputFormat