cloudera basic training > the hadoop ecosystem

ref: http://www.cloudera.com/hadoop-training-ecosystem-tour
– google origins
— mapreduce -> hadoop mapreduce
— gfs -> hdfs
— sawzall -> hive,pig (log data wherehouses)
— bigtable -> hbase
— chubby -> zookeeper (distributed block store)
– pig
— “tables” are directories in hadoop
– hive
— uses subset of sql instead of pig latin
— not good for serving realtime queries
— jdbc interface for hive exists
— pig and hive exercises on cloudera vm
— features for analyzing very large data sets

– hbase
— column-store database based on bigtable
— holds extremely large datasets
— still very young relative to hadoop
— uses hdfs
— fast single-element access
— only supports single-row transactions
— transactions block reads
— all data stored in memory. updates are written as logs to hdfs. limited because hadoop doesn’t have append (yet)
— each row is input to mapreduce

– zookeeper
— uses paxos(?) algorithm
— a distributed consensus engine
— zookeeper may be the method for creating a high-availability namenode

– fuse-dfs
— lets you mount hdfs via linux fuse
— not an alternative file server
— good for easy access to cluster

– hypertable
— competitor to hbase
— used by bidu (chinese search engine)

– kosmosfs
– sqoop
– chukwa
— hadoop log aggregation

– scribe
— general log aggregation

– mahout
— machine learning library

– cassandra
— column store database on a p2p backend

– dumbo
— python library for streaming

hadoop summit 09 > applications track > lightning talks

emi
– hadoop is for performance, not speed
– use activerecord or hibernate for rapid, iterative web dev
– few businesses write map reduce jobs –> use cascading instead
– emi is a ruby shop
– I2P
— feed + pipe script + processing node
— written in a ruby dsl
— can run on a single node or in a cluster
— all data is pushed into S3, which is great cause it’s super cheap
— stack: aws > ec2 + s3 > conductor + processing node + processing center > spring + hadoop > admin + cascading > ruby-based dsl > zookeeper > jms > rest
— deployment via chef
— simple ui (built by engineers, no designer involved)
– cascading supports dsls
– “i helpig ciomputers learn languages
– higher accuracy can be achieved using a dependency syntax tree, but this is expensive to produce
– the expectation-maximum algorithm is a cheaper alternative
– easy to parallelize, but not a natural fit for map-reduce
— map-reduce overhead can become a bottleneck
– 15x speed-up using hadoop on 50 processors
– allowing 5% of data to be dropped results in a 22x speed-up w/ no loss in accuracy
– a more complex algorithm, not more data, resulted in better accuracy
– bayesian estimation w/ bilingual pairs, a more complex algo, with 8000 only sentences results in 62% accuracy (after a week of calculation!)