Batch to SSTable

A pattern I’ve seen a couple times for immutable data:

  1. Generate the data using a batch process
  2. Store the data in an an indexed structure (like SSTable)
  3. Expose the structure through an API

The result is a key-value store with extremely high read performance.

The first time I heard about this was Twitter’s Manhattan database. Recently, I saw the pattern again at a different company. Ilya Grigorik wrote about it several years ago in the context of log-structured data, BigTable and LevelDB.

My takeaway is: this pattern is worth considering if:

  • my current store is having issues (no need to fix what’s not broken)
  • I have heavy read traffic
  • I can tolerate latency on updates

The context of log-structured makes me think that might open a door to write access too. Twitter’s post mentions a “heavy read, light write” use-case, although it also describes use of a B-tree structure rather then a simple sorted file for that case. Grigorik’s post mentions BigTable uses a “memtable” to facilitate writes.

Note Web’s IndexedDB has a similar access pattern to SSTable. If I think about remote updates as an infrequent write, then the pattern described here might be a common use-case for Web, which might bring this around full circle: Google crawls the Web in a batch process and updates an index which is read-heavy.

Object path

Problem

I want to reduce conditional assignment when setting nested keys in an object, ideally:

{a:{b:{c:value}}} = set(a/b/c, value)

This is handy for data manipulation and abstracting path-based tools like LevelDB and Firebase Realtime Database.

Solution

Use object-path or lodash’s set/get.

Note: the tools mentioned above interpret numeric path segments as array indices, which may cause unexpected results when inserting arbitrary values, eg:

set(store, 'users.5.name', 'Kwan') // store.users.length --> 6

If this is an issue, consider:

function set(obj, path, val){
  path.split('/').reduce((parent, key, i, keys) => {
    if (typeof parent[key] != 'object') {
      if (i === keys.length - 1) {
        parent[key] = val
      } else {
        parent[key] = {}
      }
    }
    return parent[key]
  }, obj)
}
function get(obj, path){
  return path.split('/').reduce((parent, key) => {
    return typeof parent === 'object' ? parent[key] : undefined
  }, obj)
}

Examples

Inverting an object:

const posts = {1: {tags: {sports: true, news: true}}, 2: {tags: {news: true}}}
const byTag = {}
Object.entries(posts).forEach(([id, post]) => {
  Object.keys(post.tags).forEach(tag => {
    set(byTag, `${tag}/${id}`, true)
  })
})
// byTag --> { sports: { '1': true }, news: { '1': true, '2': true } }

Creating and querying a prefix tree:

const flatten = require('flat')

// populate tree
const emojis = {
  '🙂': 'smile',
  '😀': 'grinning',
  '😁': 'grin'
}
const tree = {}
Object.entries(emojis).forEach(([emoji, name]) => {
  let path = name.split('').join('/') + '/' + emoji
  set(tree, path, true)
})

// lookup prefix
const prefix = 'g'
const path = prefix.split('').join('/')
const subtree = get(tree, path) || {}
const matches = Object.entries(flatten(subtree)).map(([key, val]) => {
  return key.slice(-2)
})
console.log(matches) // --> ["😀", "😁"]