Praise for the blobstore 📯

The phrase “blobstore” refers to a simple key-value store, optimized for large values. Amazon’s S3Google Cloud Storage, App Engine’s Blobstore and Twitter’s Blobstore are examples.

I’m writing to highlight what appears to be a fundamental role played by the blobstore pattern. This may be aspirational, but I’m curious to explore.

Caveat: I only have experience using blobstores and the related infra I mention below, eg CDNs, source control, document stores, CI, etc, not maintaining this infra.

I’ve bolded references to other key pieces of infrastructure.

Hosting

This is one of the more primitive, intuitive applications. Push a local file to the store and then retrieve it:

$ curl -T my_file https://blobs.example.com/my_file
$ curl https://blobs.example.com/my_file

Note how amenable the store is to a RESTful API. Related, many of the connections made in this post remind me of CouchDB.

Custom domains

Enable custom domains and we have general hosting:

$ curl -H "x-custom-domain: mydomain.com" https://blobs.example.com/my_file
$ curl https://mydomain.com/my_file

Artifact store

Push executable blobs into the store and we have an artifact store, like Twitter’s Packer and Google’s MPM.

CDN

Put a cache in front of the store and we have a CDN. Use SSDs for hosting and we may not need a cache. Or use an external CDN and the blobstore as an “origin server”.

Key-value

Key-value stores scale well because they require relatively little coordination: a key maps simply to a host and doesn’t need to be read before being written.

We improve hosting scalability by building on a key-value pattern, for example.

Document store

For better or worse, a blobstore is only aware of simple keys, so we can’t read into the values.

However, index the value and we have a document store (eg AWS DocumentDB). We can keep the blobstore pure by defining the document store as standalone infrastructure and depending on an event bus (eg Google Cloud Pub/Sub) and execution service (eg Google Cloud Functions):

  1. Document store subscribes to blobstore push events via event bus associating an executable
  2. Client pushes value to blobstore
  3. Blobstore writes value and fires event, eg “push ”
  4. Document store processes the path as described above

Now we can query values:

$ curl -d '{"k":"v"}' https://blobs.example.com/my_file
$ curl https://documents.example.com/my_file/?query="k>t"
> v

Search

Using an approach similar to building a document store, we can process the values stored in blobstore to populate a search index.

Source control

Put a review process and commit log in front of writes to the blobstore and you have scalable source controlUse a blobstore for Git’s object store, and you get scalable replication. Use a document-aware source control mechanism, like Google, and you get a scalable mono-repo.

Blobstore data is often immutable and versioned, so commits could simply be references to file versions.

CI

Subscribe to push events from source control, using an event bus and execution service and we have CI. We can listen for success events from our subscriber to get pre-submit quality control. We can write artifacts back to the blobstore as part of a continuous deploy pipeline.

Batch

We’ve already described a few cases of stream processing via an event bus, but we can also periodically iterate over values and stream them through an execution service to rebuild indices, produce artifacts, normalize data, etc.

hadoop summit 09 > applications track > lightning talks

emi
– hadoop is for performance, not speed
– use activerecord or hibernate for rapid, iterative web dev
– few businesses write map reduce jobs –> use cascading instead
– emi is a ruby shop
– I2P
— feed + pipe script + processing node
— written in a ruby dsl
— can run on a single node or in a cluster
— all data is pushed into S3, which is great cause it’s super cheap
— stack: aws > ec2 + s3 > conductor + processing node + processing center > spring + hadoop > admin + cascading > ruby-based dsl > zookeeper > jms > rest
— deployment via chef
— simple ui (built by engineers, no designer involved)
– cascading supports dsls
– “i helpig ciomputers learn languages
– higher accuracy can be achieved using a dependency syntax tree, but this is expensive to produce
– the expectation-maximum algorithm is a cheaper alternative
– easy to parallelize, but not a natural fit for map-reduce
— map-reduce overhead can become a bottleneck
– 15x speed-up using hadoop on 50 processors
– allowing 5% of data to be dropped results in a 22x speed-up w/ no loss in accuracy
– a more complex algorithm, not more data, resulted in better accuracy
– bayesian estimation w/ bilingual pairs, a more complex algo, with 8000 only sentences results in 62% accuracy (after a week of calculation!)