notes from Cloud Expo 2009: Christophe Bisciglia on “Working with Big Data and Hadoop”

falicies
– machines are reliable
– machines are unique or identifiable
– a data set should fit on one machine

hadoop
– it’s not a database
— it doesn’t serve data in real-time
— it augments existing DBs
— it does enable deeper analysis that would normally slow a relational DB
– leverages commodity hardware for big data & analytics
– cloudera does for hadoop what redhat does for linux

examples
– fox
— hat ppl are watching on set-top obxes
– autodesk
– D.E.Shaw
— analyze financial data
– mailtrust
— use hadoop to process mail logs and generate indexes that suport staff can use to make adhoc queries

data
– scientific and experimental data
– storage
— multiple machines are req’d to store the amount of data we’re interested in
— replication protects data from failure
— data is also 3 times as available

map-reduce
– allows for processing data locally
– allows for jobs to fail and be restarted

hadoop’s fault tolerance
– handled at software level

using hadoop
– map-reduce
— natively written in java
— map-reduce can be written in any language
– hive
— provides sql interface
– pig
— high level lang for ad-hoc analysis
— imperative lang
— great for researchers and techinical prod. managers

high performance DB and analytics.  when is it time for hadoop
– in general
— generation rate exceeds load capacity
— performance/cost considerations
— workloads that impede performance
– db
— 1000s of transactions per second
— many concurrent queries
— read/write
— many tables
— structured data
— high-end machines
— annual fees
– hadoop
— append only update pattern
— arbitrary keys
— unstructured or structured data
— commodity hardware
— free, open source

arch
– traditional: web server –> db –> oracle –> biz analytics
– hadoop: web server –> db –> hadop –> oracle –> biz analytics

cost
– data storage costs drops every year
– hadoop removes bottlenecks; use the right tool for the job
– makes biz intel apps smarter

tools
– cloudera’s distro for hadoop
– cloudera desktop

notes from Cloud Expo 2009: Surendra Reddy’s presentation on “Walking Through Cloud Serving at Yahoo!”

open cloud access protocol (opencap)
– definition, deployment, and life cycle mgmt of cloiud resources
– aloc, provisioning, and metering of clourd resources
– metasdat,/registry for cloud resources
– virtual infrastructure
– why ietf?  they are brilliant folks.  we’re not attaching any vendor-specific details
– open source implentation planned
– structure
— resource model
— all infastructure
— nodes, networks, etc
— resource properties
— modeled as json objects
— standard catalog of attributes, extensible
— resource operations
— operation, control, etc.
— management ntification services

– smtp is simple.  protocols must be simple
– traffic server has bindings built in (or vice versa?)
– open cloud consortium
— a national testbed to bring clouds together

Notes from Cloud Expo 2009: Raghu Ramakrishnan’s talk on the Yahoo! cloud: “key challenges in cloud computing … and the yahoo! approach”

raghu ramakrishnan
– a triumphant preso
– “key chalengeds in cloud comoputing .. and the y! approach”

this is a watershed time.  we’ve spent lots of time building packabged software now wer’re moving to the cloud

key challenges
– elastic scaling
– availabiolity
— if the cloud goes down, everyone is hosed.  consistency or performance myst be traded for availoability.
– handliong failures
— if things go wrong, what can the developer count on when things come up?
– operational efficiency
— cloud managers are db admins for 1000s of clients
– the right abstractions

yahoo’s cloud
– the cloud is an ecosystem.  it’s bigger than a single componenet.  all the pueces must work together seamlessly.

data management in the cloud
– how to make sense of the many options
– what are you trying todo?
– oltp vs olap
– oltp
— random access to a few records
— read-heavy vs write-heavy
– olap
— scan access to a large number of records
— by rows vs columns vs unstructired
– storage
— common features
— managed service. rest apis
— replication
— global footprint
— sherpa
— mopbstor

y! storage problem
– small records, 100kb or less
– structured records, lots of fields
– extreme data scale

typical applications
– user logins and profiles
— single=-record transactions suffice
– events
— alerts, social network activity
— ad clicks
app-specific data
– postings to messsage boards
– uploaded photos and tags

vsld data serving stores
– scale based on partitioning data accross machines
– range selections
— requests span machines
– availability
– replication
– durability
— is it required?
– how is data stored on a single machine?

the cap theorem
– consistency vs availability vs partition tolerance
– consistency => serializability

approaches to cap
– use a single version of a db w/ defered reconciliation
– defer transaction commit
– eventual consistency eg dynamo
– restrict transatctions eg sharded mysql
– object timelines, eg sherpa
– ref: julianbrowne.cim/artice/viewer/brewers-cap-theorem

single slide hadoop primer
– hadoop is wrte optimized, not ideal for serving

out there in the world
– oltp
— oracle, mysql,
— write optimized: cassandra
— main-mem; memchached

ways of using hadoop
– data workloads -> olap -> pig for row ops, zebra for column ops, map reduce for others

hadoop based apps
– we own the terasort benchmark

sherpa`
– parallel db
– geo replication
– structured, flexible schemas
– hashed and ordered tables
– components
— req -> routers -> (record looked up, if necessary) -> lookup cached -> individual machine
– raghu is awesome (“And then!”, sprinting through dense slides)
– write-ahead
– asynch replication
— why? we’re doing geo replication due to the physics involved
— supposing an eearthquake hits and ca falls in th ocean, two users can continue to update their profiles
– consistency model
— acid requiores synch updates
— eventual consistency works
— is there any middle ground?
— sherpa follows a timeline of changes achieved through a standard per-record primary copy protocol

operability
– cloud allows us to apperate at scale
– tablet splitting and balancing
– automatic transfer of mastership

comparing systems
– main point: all of this needs to be thought through and handled automatically

example
– sherpa, oracle, mysql work well for oltp

banchmark tiers
– cluster performance
– replication
– scale out
– availability
– we’d like to do this a group effort, in keeping w/ our philosophy

the integrated cloud
– big idea: declrative lang for specifying structure of service
– key insight: multi-env
– central mechanism: the integrated cloud
– surrendra will talk about htis

foundation componenets
– how to describe app
– desc for resources, entrypoijts, bindings, etc

yst hadled 16.4 million uniques for mj death news

acm socc
– acm symposium on cloud computing