A pattern for reducing ambiguity

Here’s the pattern:

  1. Identify the owner
  2. Have the owner describe the problem and solution in a doc
  3. Invite stakeholders to comment on the doc
  4. If discussions on the doc linger, schedule a meeting to finalize
  5. Document conclusions in the doc, to facilitate communication and closure, and for future reference

This may seem obvious, but I often forget it, especially in cases where a task starts small and grows in complexity. A problem may seem too small to formally “own”. A solution may seem too trivial to document. Stakeholders may attend a meeting without context. A meeting may conclude with stakeholders feeling like they’ve expressed themselves, but there’s no actionable plan to resolve the problem.

Identifying one person from a group of stakeholders to own the project and be responsible for leading work to completion reduces organizational ambiguity.

Documenting the problem and proposed solution in writing reduces ambiguity by capturing ideas from a variety of mediums in a single, relatively objective place that stakeholders can comment on.

Documentation alone may achieve closure, but it may also spawn extensive commentary. Meetings are relatively expensive, but scheduling a meeting to drive closure reduces ambiguity by distilling commentary on the document into conclusions.

Documenting conclusions reduces ambiguity by rendering them in an objective form all stakeholders can agree on.

A few symptoms that indicate when this pattern might be useful:

  1. Endless back-and-forth in chat, bug comments, etc, which can give the impression of progress, but never resolve the issue
  2. Multiple and/or cross-functional stakeholders, which can obscure priorities
  3. Multiple people opining on a solution and/or answering questions, which can obscure ownership
  4. A problem that drags on, which can indicate it’s important, but inappropriately owned

This ties into larger discussions around project planning (e.g., managing planned vs unplanned work), and meeting efficiency (e.g., inviting stakeholders, assigning pre-work and clarifying outcomes), but the point here is just to succinctly identify an organizational pattern and when it can be helpful.

Batch to SSTable

A pattern I’ve seen a couple times for immutable data:

  1. Generate the data using a batch process
  2. Store the data in an an indexed structure (like SSTable)
  3. Expose the structure through an API

The result is a key-value store with extremely high read performance.

The first time I heard about this was Twitter’s Manhattan database. Recently, I saw the pattern again at a different company. Ilya Grigorik wrote about it several years ago in the context of log-structured data, BigTable and LevelDB.

My takeaway is: this pattern is worth considering if:

  • my current store is having issues (no need to fix what’s not broken)
  • I have heavy read traffic
  • I can tolerate latency on updates

The context of log-structured makes me think that might open a door to write access too. Twitter’s post mentions a “heavy read, light write” use-case, although it also describes use of a B-tree structure rather then a simple sorted file for that case. Grigorik’s post mentions BigTable uses a “memtable” to facilitate writes.

Note Web’s IndexedDB has a similar access pattern to SSTable. If I think about remote updates as an infrequent write, then the pattern described here might be a common use-case for Web, which might bring this around full circle: Google crawls the Web in a batch process and updates an index which is read-heavy.

Easy & advanced

An API design pattern I’ve found helpful is to think about usage in two modes: easy and advanced.

This is especially helpful in debates. We may be able to accommodate a valid, but advanced, feature without cluttering the API, by housing it in an advanced subset of the API and docs.

I recently heard another phrasing of the same idea from David Poll: “Common case easy. Uncommon case possible.”

A nice data mart 🏪

The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team

https://en.m.wikipedia.org/wiki/Data_mart

I don’t have a lot of experience with data marts, but I recently met one that seems nice and simple.

The store benefits from a few other abstractions:

  1. a service that just ingests and persists client events
  2. a query abstraction, like Hive
  3. trustworthy authentication and list membership infra

Given these, the store in question simplifies the process of utilizing data by abstracting a few common requirements:

  1. a simple config DSL specifies which query to run, the frequency to run it, the output table, deletion conditions, etc. Specifying config via files enables use of common source control tools.
  2. three predefined processing stages (raw-to-normalized, normalized-to-problem-specific, problem-specific-to-view-specific). New event sources, aggregations and views can be independently defined by adding new config files.
  3. common styling and libraries for data visualization
  4. access is generalized to a few tiers of increasing restriction, eg team, division, company. The lowest level might be freely granted to teams for their own business intelligence, and the highest level restricted to executives for making revenue-specific decisions.

In retrospect, this seems pretty straightforward. I’m remembering a tool from another team (basically Hadoop + Rails + D3) that had the same goals, but didn’t have the query, scheduling or ACL abstractions underneath. It was replaced by an external tool that was terrible to the point of being unusable, but more secure. Eventually, we dumped normalized data in a columnar store that was also secure and easier to use for our team’s business intelligence, but would’ve been insufficient for things like periodically updating charts. I guess it’s the combination of data store features and supporting infra that makes the magic happen.

Entropy

A colleague once relayed to me someone else’s observation that every syntax variation allowed by a language will eventually appear in a code base. Resisting the process of breaking down into what’s possible requires energy. The idea that “naming things is hard” seems a variation of this. If I could remember the originator, I’d call it ___’s Law. In the meantime, I think “entropy” is the general form.

With its Greek prefix en-, meaning “within”, and the trop- root here meaning “change”, entropy basically means “change within (a closed system)”

https://www.merriam-webster.com/dictionary/entropy

In this context, static analysis tools like linters help limit what’s possible.

An organizational approach I’ve seen a couple times is to embrace the range of possibility. For example, given a camp in favor of Java and another in favor of Scala, a former team avoided endless debate by supporting both until there was an obvious reason not to. Another example is Google Cloud’s reconciliation of REST and gRPC:

All our Cloud APIs expose a simple JSON REST interface that you can call directly or via our client libraries. Some of our latest generation of APIs also provide an RPC interface that lets clients make calls to the API using gRPC: many of our client libraries use this to provide even better performance when you use these APIs

https://cloud.google.com/apis/docs/overview#multiple-surfaces-rest-and-grpc

Another organizational strategy David Poll brilliantly described: products will express the org structure that created them (Conway’s Law); we can expend energy resisting this, eg review processes, and/or we can create orgs in the shape of the products we intend.

Better together SDK pattern

I’m a fan of an SDK product pattern I’ve heard people call “better together”. The idea is for SDKs to be decoupled, but complementary.

An example is an SDK that needs telemetry. One approach would be to add telemetry to the SDK, but this has a few problems: bloat, opacity, redundancy and coupling. An app may already have a telemetry SDK installed, so bundling another with an unrelated SDK bloats the app. Data logged inside the SDK is opaque to the app, which also complicates any SDK billing story. If the SDK does want to export telemetry data, it will need to build telemetry-specific logic redundant to the app’s telemetry provider. Any telemetry logic built by the SDK is coupled to the SDK.

The better-together pattern provides an alternative. To continue with the example above, an SDK requiring telemetry could detect if a telemetry provider is installed and publish events to it. A simplistic example would be to provide a method on the SDK to set a telemetry provider, eg:

class SDK {
   constructor(telemetry = null);
   …
   sayHi(){
     if (telemetry) {
       telemetry.logEvent(‘said_hi’);
     }
   }
 }
 …
 telemetry = new Telemetry();
 sdk = new SDK(telemetry);
 sdk.sayHi();

With this approach telemetry is only included in the app if the app owner wants it, minimizing bloat. Telemetry from the SDK is visible alongside the app’s other telemetry. The SDK can focus on whatever it does best. Telemetry is reusable elsewhere in the app.

One potential downside with this pattern concerns differentiating “internal” use-cases. Continuing with the telemetry example, the SDK may want to log events that are unrelated to the app’s functionality. I’ve seen three approaches: don’t differentiate, differentiate throughout, or don’t use the better-together pattern. The first approach treated all data as belonging to the app and namespaced all events published by the SDK, which worked well. The second approach was expensive due to technical complexity and eventually discontinued. The third approach was expensive due to redundant staffing, infra, UX, etc, but necessary so long as some parties don’t buy into the better-together pattern. I guess this stresses the “together” part of better-together 🙂

View

The joy of top-down rendering.

Problem

I want to present data, ideally as view = render(data).

Solution

I really like the view mechanics provided by choo/yo-yo/bel.

const html = require('bel')
const nanobus = require('nanobus')
const yo = require('yo-yo')

const bus = nanobus()
const render = yo.update.bind(yo, document.body)
const emit = bus.emit.bind(bus)

bus.on('change', (name) => {
  const state = {}
  state.name = name.toUpperCase()
  render(view(state, emit))
})

function view(state, emit){
  return html`
    <body>
      Hello, <input value="${state.name}" placeholder="name" onkeyup=${onKeyUp}>
    </body>
  `
  function onKeyUp(e){
    emit('change', e.target.value)
  }
}

Object path

Problem

I want to reduce conditional assignment when setting nested keys in an object, ideally:

{a:{b:{c:value}}} = set(a/b/c, value)

This is handy for data manipulation and abstracting path-based tools like LevelDB and Firebase Realtime Database.

Solution

Use object-path or lodash’s set/get.

Note: the tools mentioned above interpret numeric path segments as array indices, which may cause unexpected results when inserting arbitrary values, eg:

set(store, 'users.5.name', 'Kwan') // store.users.length --> 6

If this is an issue, consider:

function set(obj, path, val){
  path.split('/').reduce((parent, key, i, keys) => {
    if (typeof parent[key] != 'object') {
      if (i === keys.length - 1) {
        parent[key] = val
      } else {
        parent[key] = {}
      }
    }
    return parent[key]
  }, obj)
}
function get(obj, path){
  return path.split('/').reduce((parent, key) => {
    return typeof parent === 'object' ? parent[key] : undefined
  }, obj)
}

Examples

Inverting an object:

const posts = {1: {tags: {sports: true, news: true}}, 2: {tags: {news: true}}}
const byTag = {}
Object.entries(posts).forEach(([id, post]) => {
  Object.keys(post.tags).forEach(tag => {
    set(byTag, `${tag}/${id}`, true)
  })
})
// byTag --> { sports: { '1': true }, news: { '1': true, '2': true } }

Creating and querying a prefix tree:

const flatten = require('flat')

// populate tree
const emojis = {
  '🙂': 'smile',
  '😀': 'grinning',
  '😁': 'grin'
}
const tree = {}
Object.entries(emojis).forEach(([emoji, name]) => {
  let path = name.split('').join('/') + '/' + emoji
  set(tree, path, true)
})

// lookup prefix
const prefix = 'g'
const path = prefix.split('').join('/')
const subtree = get(tree, path) || {}
const matches = Object.entries(flatten(subtree)).map(([key, val]) => {
  return key.slice(-2)
})
console.log(matches) // --> ["😀", "😁"]

Client-side stream processing

Solution

Given a bus and store:

struct Post {
  let id: String
  var text: String
  var likeState: Bool
}
protocol State {}
struct RootState : State {
  var userId: String? = nil
  var posts: [String:Post] = [:]
}
protocol Renderable {
  func render(_ state: State)
}
struct PostsImpression: Event {}
struct LikeRequested: Event {
  let postId: String
  let likeState: Bool
}
class Reducer : Subscriber {
  let store: Store
  let controller: Renderable
  var state: RootState
  init(store: Store, controller: Renderable, state: RootState){
    self.store = store
    self.controller = controller
    self.state = state
  }
  func onEvent(event: Event){
    switch event {
    case _ as PostsImpression:
      store.get("posts/\(state.userId!)")
      store.get("likes/\(state.userId!)")
    case let event as LikeRequested:
      store.set("likes/\(state.userId!)/\(event.postId)", event.likeState)
    case let event as Value where event.key.hasPrefix("likes"):
      let postId = event.key.components(separatedBy: "/").last!
      let likeState = event.val as! Bool
      state.posts[postId]?.likeState = likeState
      controller.render(state)
    case let event as Value where event.key.hasPrefix("posts"):
      let post = Post(
        id: event.key.components(separatedBy: "/").last!,
        text: event.val as! String,
        likeState: false) 
      state.posts[post.id] = post
      controller.render(state)
    default:
      break
    }
  }
}

Context

Redux’s reducer inspired me to think about this. Kleppmann’s blog post on turning the database inside out inspired me to think about stream processing in general.

Problem

Consolidate event processing from UI and data streams.

Praise for the humble bus 🚌

Context

This is a stream-of-consciousness gush for a pattern I like. I start by stating some things I like followed by a pattern that produces these things and then attempt to state the problem being solved (in case other folks like me appreciate a problem statement).

I’m a fan of the unidirectional event flow first brought to my attention by React/Redux. Prakhar mentioned this is also called the yo-yo pattern. (Events bubble up, views render down). yo-yo.js provides a delightfully simple implemention. choo completes yo-yo pattern by building on yo-yo.js and injecting an event bus into the view renderer.

Slightly related, I’m also enamored by the notion of an append-only log, reverently described by Jay Kreps and Martin Kleppmann in The Log and Turning the database inside-out with Apache Samza, respectively. Kleppmann provides additional, wonderful context in Data Intensive Applications.

In my experience, event logging from a client can be tricky to maintain. A couple helpful patterns: enable stdout-logging close to the event source, and explicitly enumerate events.

Solution

In this context, I’ve developed deep appreciation for the simple pubsub pattern, and the notion of an "event bus" through which published events flow to subscribers. Although busses and logs (and indices) frequently appear together, the bus seems most primitive.

This pattern is nothing new, but here’s a simplistic implementation I find easy to reason about:

protocol Event {}
struct LikeEvent : Event {}
protocol Subscriber {
  func onEvent(event: Event)
}
class StdoutSubscriber : Subscriber {
  func onEvent(event: Event) {
    print(event)
  }
}
class Bus {
  var subscribers: [String:Subscriber] = [:]
  func sub(_ subscriber: Subscriber){
    self.subscribers[key(subscriber)] = subscriber
  }
  func unsub(subscriber: Subscriber){
    self.subscribers[key(subscriber)] = nil
  }
  func pub(_ event: Event){
    for subscriber in subscribers.values {
      subscriber.onEvent(event: event)
    }
  }
  func key(_ subscriber: Subscriber) -> String {
    return String(describing: type(of: subscriber))
  }
}
let bus = Bus()
bus.sub(StdoutSubscriber())
// ... on "like" button tap
bus.pub(LikeEvent())

Events are first-class in Node, so an easy equivalent to the above would be:

var EventEmitter = require('events')
var bus = new EventEmitter()
function stdoutSubscriber(event){
  console.log(`event=${event}`)
}
bus.on('event', stdoutSubscriber)
bus.emit('event', 'like')

Problem

Given all the above, I think the problem I find the bus solving is: reduce complexity in a distributed system by allowing event sources to publish, and event processors to subscribe, as plainly as possible.

Caveat

I think decoupling event production from processing does have a cost. We lose locality, which complicates reasoning. In cases where production/consumption can be colocated, eg async operations on a thread that’s safe to block (Finagle’s use of Scala’s composable futures is a great example), I think it’s worth considering.

Related

Node’s event emitter supports the notion of a "channel". Kafka calls them "topics". This concept reminds me of Objective C’s KVO, and Firebase’s realtime database, which allow me to subscribe to the stream of changes for a given "key" (or "path").