cloudera basic training > the hadoop ecosystem

ref: http://www.cloudera.com/hadoop-training-ecosystem-tour
– google origins
— mapreduce -> hadoop mapreduce
— gfs -> hdfs
— sawzall -> hive,pig (log data wherehouses)
— bigtable -> hbase
— chubby -> zookeeper (distributed block store)
– pig
— “tables” are directories in hadoop
– hive
— uses subset of sql instead of pig latin
— not good for serving realtime queries
— jdbc interface for hive exists
— pig and hive exercises on cloudera vm
— features for analyzing very large data sets

– hbase
— column-store database based on bigtable
— holds extremely large datasets
— still very young relative to hadoop
— uses hdfs
— fast single-element access
— only supports single-row transactions
— transactions block reads
— all data stored in memory. updates are written as logs to hdfs. limited because hadoop doesn’t have append (yet)
— each row is input to mapreduce

– zookeeper
— uses paxos(?) algorithm
— a distributed consensus engine
— zookeeper may be the method for creating a high-availability namenode

– fuse-dfs
— lets you mount hdfs via linux fuse
— not an alternative file server
— good for easy access to cluster

– hypertable
— competitor to hbase
— used by bidu (chinese search engine)

– kosmosfs
– sqoop
– chukwa
— hadoop log aggregation

– scribe
— general log aggregation

– mahout
— machine learning library

– cassandra
— column store database on a p2p backend

– dumbo
— python library for streaming

2 thoughts on “cloudera basic training > the hadoop ecosystem

  1. I am just curious to know whether is there implementation of hadoop for Location Based Services. If not is there any way we can achieve that.I want to find out the Location of Cellphones in a locality with approximately 50 mts of accuracy but without the use of GSM.

Comments are closed.