barcamp san diego 5: “hbase, cassandra, bigtable, simpledb discussion”

– amazon dyno (dynamo?)
– cassandra
— latest time stamp wins

– managing distributed records
— use checksum to verify data health

– why use an hbase
— random reads on disks are slow; reading from sequential data on disk is the only way to go
— simple fetch queries are roughly equivalent to an hbase lookup

– hdfs / hbase division?

– how to update record?
— hbase is not replacing relational dbs; they are used in conjunction.
— they can replace relational dbs, if the data we’re storing is normalized by nature, eg we’re just using it for user records
— if the data is actually normalized in the hbase, the update is straightforward.ย  If the data is denormalized in the hbase, we’re better off having the data normalized in a relational db, updating the normal db, and then updating the hbase in a batch process later.

– memcache vs hbase

– db sharding
— painful because it’s application logic and relational dbs are optimized for joins.
— hbase is optimized for sharding

silicon valley hadoop user group 5-20-09: cloudera on automatic database import w/ sqoop

motivation
- hadoop is great for unstructured data
- hadoop is not great for structured data
- how to glue data from mysql to unstructured data for hadoop

DBInputFormat
- uses jdbc to connect to db

DBWritable
- a bridge from jdbc result set to mapper value

Sqoop
- SQL-to-Hadoop
- jdbc-based interface
- auto datatype generation
- uses mapreduce to read tables from db
- imprts into hdfs and creates java file
- easy to import into hive
- serialized output is comma-separated

silicon valley hadoop user group 5-20-09: ibm research on hadoop over gpfs

- tested on jbot
- equivalent performance between hdfs and gpfs for non-trivial applications
- used Bonnie for filesys benchmarking
- cluster topology
-- standard hadoop uses local storage
--- cheap, scalable
-- full san uses central store
--- configurability of compute nodes
--- not as scalable
-- sub-cluster uses split storage
- conclusions
-- abstraction of filesys from mapreduce was good
-- gpfs (and other cluster filesys) can match performance of hdfs
- scalability?
-- gpfs runs on thousands of nodes
- fault tolerance?
-- not tested yet
- how similar is gpfs to unix filesys?
-- consistency issues are handled in a similar way