ref: http://www.cloudera.com/hadoop-training-ecosystem-tour
– google origins
— mapreduce -> hadoop mapreduce
— gfs -> hdfs
— sawzall -> hive,pig (log data wherehouses)
— bigtable -> hbase
— chubby -> zookeeper (distributed block store)
– pig
— “tables” are directories in hadoop
– hive
— uses subset of sql instead of pig latin
— not good for serving realtime queries
— jdbc interface for hive exists
— pig and hive exercises on cloudera vm
— features for analyzing very large data sets
– hbase
— column-store database based on bigtable
— holds extremely large datasets
— still very young relative to hadoop
— uses hdfs
— fast single-element access
— only supports single-row transactions
— transactions block reads
— all data stored in memory. updates are written as logs to hdfs. limited because hadoop doesn’t have append (yet)
— each row is input to mapreduce
– zookeeper
— uses paxos(?) algorithm
— a distributed consensus engine
— zookeeper may be the method for creating a high-availability namenode
– fuse-dfs
— lets you mount hdfs via linux fuse
— not an alternative file server
— good for easy access to cluster
– hypertable
— competitor to hbase
— used by bidu (chinese search engine)
– kosmosfs
– sqoop
– chukwa
— hadoop log aggregation
– scribe
— general log aggregation
– mahout
— machine learning library
– cassandra
— column store database on a p2p backend
– dumbo
— python library for streaming