cloudera basic training > the hadoop ecosystem

ref: http://www.cloudera.com/hadoop-training-ecosystem-tour
– google origins
— mapreduce -> hadoop mapreduce
— gfs -> hdfs
— sawzall -> hive,pig (log data wherehouses)
— bigtable -> hbase
— chubby -> zookeeper (distributed block store)
– pig
— “tables” are directories in hadoop
– hive
— uses subset of sql instead of pig latin
— not good for serving realtime queries
— jdbc interface for hive exists
— pig and hive exercises on cloudera vm
— features for analyzing very large data sets

– hbase
— column-store database based on bigtable
— holds extremely large datasets
— still very young relative to hadoop
— uses hdfs
— fast single-element access
— only supports single-row transactions
— transactions block reads
— all data stored in memory. updates are written as logs to hdfs. limited because hadoop doesn’t have append (yet)
— each row is input to mapreduce

– zookeeper
— uses paxos(?) algorithm
— a distributed consensus engine
— zookeeper may be the method for creating a high-availability namenode

– fuse-dfs
— lets you mount hdfs via linux fuse
— not an alternative file server
— good for easy access to cluster

– hypertable
— competitor to hbase
— used by bidu (chinese search engine)

– kosmosfs
– sqoop
– chukwa
— hadoop log aggregation

– scribe
— general log aggregation

– mahout
— machine learning library

– cassandra
— column store database on a p2p backend

– dumbo
— python library for streaming

barcamp san diego 5: “hbase, cassandra, bigtable, simpledb discussion”

– amazon dyno (dynamo?)
– cassandra
— latest time stamp wins

– managing distributed records
— use checksum to verify data health

– why use an hbase
— random reads on disks are slow; reading from sequential data on disk is the only way to go
— simple fetch queries are roughly equivalent to an hbase lookup

– hdfs / hbase division?

– how to update record?
— hbase is not replacing relational dbs; they are used in conjunction.
— they can replace relational dbs, if the data we’re storing is normalized by nature, eg we’re just using it for user records
— if the data is actually normalized in the hbase, the update is straightforward.ย  If the data is denormalized in the hbase, we’re better off having the data normalized in a relational db, updating the normal db, and then updating the hbase in a batch process later.

– memcache vs hbase

– db sharding
— painful because it’s application logic and relational dbs are optimized for joins.
— hbase is optimized for sharding