Batch to SSTable

A pattern I’ve seen a couple times for immutable data:

  1. Generate the data using a batch process
  2. Store the data in an an indexed structure (like SSTable)
  3. Expose the structure through an API

The result is a key-value store with extremely high read performance.

The first time I heard about this was Twitter’s Manhattan database. Recently, I saw the pattern again at a different company. Ilya Grigorik wrote about it several years ago in the context of log-structured data, BigTable and LevelDB.

My takeaway is: this pattern is worth considering if:

  • my current store is having issues (no need to fix what’s not broken)
  • I have heavy read traffic
  • I can tolerate latency on updates

The context of log-structured makes me think that might open a door to write access too. Twitter’s post mentions a “heavy read, light write” use-case, although it also describes use of a B-tree structure rather then a simple sorted file for that case. Grigorik’s post mentions BigTable uses a “memtable” to facilitate writes.

Note Web’s IndexedDB has a similar access pattern to SSTable. If I think about remote updates as an infrequent write, then the pattern described here might be a common use-case for Web, which might bring this around full circle: Google crawls the Web in a batch process and updates an index which is read-heavy.