//we moved quickly through this, so the notes are sparse
– job
— a full program
– task
— by default, hadoop creates the same amount of tasks as there are input blocks
— task attempts
— tasks are attempted at least once
— multiple attempts in parellel are performed w/ speculative execution turned on
– tasktracker
— forks jvm process for each task
– job distribution
— mapreduce programs = jar + xml config
— running a job puts jar and xml in hdfs
– data distribution
— data locality decreases when multiple tasks are running
– mapreduce flow
— client creates joconf
— identify map and reducer classes
— specify inputs/outputs
— set optional settings
— job launches jobclient
— runjob blcks until the job completes
— submitjob is non-blocking
— …
— tasttracker
— perioducally query jobtracker for work
— …
— write for cache coherency (re-use objects in loops(?))
— reusing memory locations => 2x speed-up
— all k/v pairs given by hadoop use this model
//is avro comparable to thrift?
– getting data to mapper
— data sets are specified
— input sets contain at least 1 record and are composed of full blocks
– file input format
— most people use SequenceFileInputFormat
— usually we store all our data in hdfs and then ignore what we don’t need, rather than spending time formatting the data when it’s input
— …
– shuffling
— what happens btwn map and reduce
– write the output
— OutputFormat is analagous to InputFormat