notes from Cloud Expo 2009: Christophe Bisciglia on “Working with Big Data and Hadoop”

– machines are reliable
– machines are unique or identifiable
– a data set should fit on one machine

– it’s not a database
— it doesn’t serve data in real-time
— it augments existing DBs
— it does enable deeper analysis that would normally slow a relational DB
– leverages commodity hardware for big data & analytics
– cloudera does for hadoop what redhat does for linux

– fox
— hat ppl are watching on set-top obxes
– autodesk
– D.E.Shaw
— analyze financial data
– mailtrust
— use hadoop to process mail logs and generate indexes that suport staff can use to make adhoc queries

– scientific and experimental data
– storage
— multiple machines are req’d to store the amount of data we’re interested in
— replication protects data from failure
— data is also 3 times as available

– allows for processing data locally
– allows for jobs to fail and be restarted

hadoop’s fault tolerance
– handled at software level

using hadoop
– map-reduce
— natively written in java
— map-reduce can be written in any language
– hive
— provides sql interface
– pig
— high level lang for ad-hoc analysis
— imperative lang
— great for researchers and techinical prod. managers

high performance DB and analytics.  when is it time for hadoop
– in general
— generation rate exceeds load capacity
— performance/cost considerations
— workloads that impede performance
– db
— 1000s of transactions per second
— many concurrent queries
— read/write
— many tables
— structured data
— high-end machines
— annual fees
– hadoop
— append only update pattern
— arbitrary keys
— unstructured or structured data
— commodity hardware
— free, open source

– traditional: web server –> db –> oracle –> biz analytics
– hadoop: web server –> db –> hadop –> oracle –> biz analytics

– data storage costs drops every year
– hadoop removes bottlenecks; use the right tool for the job
– makes biz intel apps smarter

– cloudera’s distro for hadoop
– cloudera desktop

notes from cloudera basic training > installing the cloudera hadoop distro locally and on ec2

installing cloudera distro on a cluster
– motivation: hadoop is complocated to install
– cloudera uses Alternatives to manage a|b testing
– cloudera has created a “configurator” that will generate an rpm customized to your cluster
— generates the configuration files and a custom installer. Each can optionally be used together or separately
– alternatively, you can install an unconfigured distro
– for large scale deployment, use puppet, bcfg2, cfengine, etc. to manage the cluster
— cloudera’s tool can still be used to generate config scripts
– storing data in ebs takes advantage of locality and is much faster than s3
— ebs is more performant than normal hard drives

cloudera basic training > the hadoop ecosystem

– google origins
— mapreduce -> hadoop mapreduce
— gfs -> hdfs
— sawzall -> hive,pig (log data wherehouses)
— bigtable -> hbase
— chubby -> zookeeper (distributed block store)
– pig
— “tables” are directories in hadoop
– hive
— uses subset of sql instead of pig latin
— not good for serving realtime queries
— jdbc interface for hive exists
— pig and hive exercises on cloudera vm
— features for analyzing very large data sets

– hbase
— column-store database based on bigtable
— holds extremely large datasets
— still very young relative to hadoop
— uses hdfs
— fast single-element access
— only supports single-row transactions
— transactions block reads
— all data stored in memory. updates are written as logs to hdfs. limited because hadoop doesn’t have append (yet)
— each row is input to mapreduce

– zookeeper
— uses paxos(?) algorithm
— a distributed consensus engine
— zookeeper may be the method for creating a high-availability namenode

– fuse-dfs
— lets you mount hdfs via linux fuse
— not an alternative file server
— good for easy access to cluster

– hypertable
— competitor to hbase
— used by bidu (chinese search engine)

– kosmosfs
– sqoop
– chukwa
— hadoop log aggregation

– scribe
— general log aggregation

– mahout
— machine learning library

– cassandra
— column store database on a p2p backend

– dumbo
— python library for streaming

cloudera training > mapreduce and hdfs > hdfs


– redundant storage of massive amounts of data on unreliable computers
– advantages over existing file system:
— handles much bigger data sets
— different workload and design priorities
– it’s conceptually comparable (very roughly) to zip file structure
– assumptions
— high component failure rate
— modest number (~1000) of huge (100mb) files
— files are write-once and then appended to
— large streaming reads, instead of seeks
— disks are really good at streaming, but bad at seeking
— high sustained throughput > low latency
– design decisions
— files stored as blocks
— block replication is asynch (this is why there is no updates)
— reliability through replication
— single master (namenode)
— a simple architecture, but also a single point of failure
— no data caching
— data nodes periodically heartbeat w/ the namenode
— creating a file flow: start transaction > define metadata > end transaction
— intermediate files are written locally to mapper, and then reducers fetch that data
– based on gfs architecture
— all data fetched over http
– metadata
— single namenode stores all metadata in memory
— two data structures on disk
— a snapshot of metadata
— a log of changes since snapshot
— the “secondary namenode”, which has a terrible name (should be something like “namenode helper”), updates the snapshot and informs namenode of new snapshot
— namenode snapshot should be written to an nfs-mounted location, so if the namenode fails, the snapshot will survive
— google has optimized linux kernel for gfs, but cloudera just uses x3(?), and others use xfs
— datanodes store opaque file contents in “block” objects on underlying local filesystem
– conclusions
— tolerates failure
— interface is customized for the job, but familiar to developers
— reliably stores terabytes and petabytes of data