notes from cloudera basic training > hadoop mapreduce deep dive

//we moved quickly through this, so the notes are sparse
– job
— a full program

– task
— by default, hadoop creates the same amount of tasks as there are input blocks

— task attempts
— tasks are attempted at least once
— multiple attempts in parellel are performed w/ speculative execution turned on

– tasktracker
— forks jvm process for each task

– job distribution
— mapreduce programs = jar + xml config
— running a job puts jar and xml in hdfs

– data distribution
— data locality decreases when multiple tasks are running

– mapreduce flow
— client creates joconf
— identify map and reducer classes
— specify inputs/outputs
— set optional settings
— job launches jobclient
— runjob blcks until the job completes
— submitjob is non-blocking
— …
— tasttracker
— perioducally query jobtracker for work
— …
— write for cache coherency (re-use objects in loops(?))
— reusing memory locations => 2x speed-up
— all k/v pairs given by hadoop use this model
//is avro comparable to thrift?

– getting data to mapper
— data sets are specified
— input sets contain at least 1 record and are composed of full blocks

– file input format
— most people use SequenceFileInputFormat
— usually we store all our data in hdfs and then ignore what we don’t need, rather than spending time formatting the data when it’s input

— …

– shuffling
— what happens btwn map and reduce

– write the output
— OutputFormat is analagous to InputFormat

cloudera basic training > the hadoop ecosystem

ref: http://www.cloudera.com/hadoop-training-ecosystem-tour
– google origins
— mapreduce -> hadoop mapreduce
— gfs -> hdfs
— sawzall -> hive,pig (log data wherehouses)
— bigtable -> hbase
— chubby -> zookeeper (distributed block store)
– pig
— “tables” are directories in hadoop
– hive
— uses subset of sql instead of pig latin
— not good for serving realtime queries
— jdbc interface for hive exists
— pig and hive exercises on cloudera vm
— features for analyzing very large data sets

– hbase
— column-store database based on bigtable
— holds extremely large datasets
— still very young relative to hadoop
— uses hdfs
— fast single-element access
— only supports single-row transactions
— transactions block reads
— all data stored in memory. updates are written as logs to hdfs. limited because hadoop doesn’t have append (yet)
— each row is input to mapreduce

– zookeeper
— uses paxos(?) algorithm
— a distributed consensus engine
— zookeeper may be the method for creating a high-availability namenode

– fuse-dfs
— lets you mount hdfs via linux fuse
— not an alternative file server
— good for easy access to cluster

– hypertable
— competitor to hbase
— used by bidu (chinese search engine)

– kosmosfs
– sqoop
– chukwa
— hadoop log aggregation

– scribe
— general log aggregation

– mahout
— machine learning library

– cassandra
— column store database on a p2p backend

– dumbo
— python library for streaming

cloudera training > MapReduce and HDFS > map-reduce overview

ref: http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf

– borrows from functional programming: map, reduce
– provides an interface for map/reduce; we must implement the interface
– map
— the mapper can emit an arbitrary pair, not necessarily the input key/val
— the mapper runs simultaneously on multiple machines; the first to complete is used
— each map runs in its own jvm
— each run in parallel
— input is usualy 64MB – 128MB chunks (results in more streaming)

– reduce
— the number of reduces that run corresponds to the number of output files
— ideally, we want 1 reduce
— run in paralllel

– flow
— data store of k/v pairs > map > barrier (shuffle phase) > reduce > result
– chained map-reduce jobs are common
– all values are processed independently
– bottleneck: now reduce can run until all maps are finished
– combiner
— runs immediately after mapper on map node
— can use reducer function if reducer is commutative and associative

– conclusions
— mapreduce is a useful abstraction
— simplifies large scale comp
— lets the programmer focus on the problem and the library handle the details of distribution