PlayFab’s LiveOps guide

My experience is largely in features and infrastructure for growing and retaining users, aka “growth”. Recently, I learned the games industry has a comparable concept “LiveOps”. I’ve found value in using the latter to learn more about the space in general.

PlayFab has an excellent guide to LiveOps. The guide is a brief and accessible reference, so I’ll just jot notes below.


The guide summarizes LiveOps as “Games are shifting from one-off experiences to services that evolve. Developers of successful live games focus on understanding their players, meeting their individual needs, and cultivating long-term relationships”

In growth-speak, I’d phrase this as analytics, personalization and retention. There is some direct association with growth: “We’re investing in games that people play for longer and engage with much more deeply … to drive growth …“

I guess the “live” in LiveOps refers to “services that evolve”: “… live services

represented nearly 58% of Electronic Arts’ net revenue …” I can see how this would be a big shift from encoding all logic in a released binary. “save client updates for entirely new features or large assets”

“With a LiveOps game, the real work starts with launch instead of ending there” I think there’s less of a distinction in a non-game app; most apps already pull content from a network.

A summary of LiveOps features:

  • “server-side configurations …”
  • “… content data untethered from client versions …”
  • “… in-depth analytics”

“Content data” refers to “… new experiences and content, thereby extending the lifetime of our games“, which explains the claim that LiveOps can reduce up-front investment.

“… the ‘live’ part of LiveOps goes through three post-launch stages:

  1. Iterating your Game …
  2. Personalizing the Player Experience …
  3. Managing the Community …”

I think all of these apply to apps in general.

I like how the breakdown also indicates infra and talent required in each step:

  1. Iteration requires “build, test-and-deploy pipeline, basic analytics, and content configurations“
  2. Personalization requires “data analysts and product managers to use more sophisticated tools such as recommendation systems and live events managers“
  3. Community management requires “customer support staff, marketing, and community managers … guild systems, user-generated content, and multiplayer services for matchmaking, cross-network play, and communications.”

The guide presents these as sequential steps of maturity. In my experience with growth, 1 and 3 came before 2, since generating per-user state was relatively resource intensive. Also, we could start with relatively naive approaches to 2 and 3, eg friend recommendations by a static topic like “sports”, and then layer on more sophisticated alternatives, eg per-user behavioral predictions.

Connecting to people

LiveOps has a user-centric perspective: “LiveOps developers know that players and communities evolve. When creating a game, we’re not movie directors with a singular vision, but more like TV network program managers … LiveOps games are player-centric and react to player desires and needs …”

I’m a fan of a customer-centric perspective. Differentiating user-centric seems like it should be obvious, but it’s nice to see it emphasized.

My recent experience is in growth as a service, which is why I differentiate “users” from “customers” (Customers would be apps that have users/players.)


“With LiveOps, acquisition is an ongoing process” I guess this recognizes that people may come and go from a game, although in the terminology I’m familiar with, returning would be “resurrection” or “reactivation”. (“Reactivation” is listed later as an example of acquisition.)

I appreciate the list of common acquisition sources:

  • Store discovery
  • Public relations
  • Advertising
  • Cross-promotion
  • Influencer marketing
  • Social installs, eg shares
  • Reactivation

Helpful tip: “Track player engagement and retention based on source of acquisition and look for trends” Platforms providing acquisition channels should also provide attribution, eg Google Play Store’s campaign attribution passed to Android’s Install Referrer API.

Kind of obvious, but the guide recommends A/B testing reactivation inducements. Later the guide simply recommends testing everything all the time.


Retention is “one of the only data-supported ways to know if players enjoy playing“

Common techniques for increasing retention:

  • Adding content
  • Stickiness due to investment – this comes up later in the “conversion” section
  • Social connections
  • Compelling game mechanics, eg Go has “simple rules that allow for endless new strategies”

Helpful tip: “Try to communicate only what’s interesting and valuable, and mix up rewards so they don’t become background noise” I’ve heard this phrased as “fatigue” Messaging platforms should provide features to help customers avoid fatiguing users.


The definition of “engagement” or “active” usage is often mysterious to me, so I appreciate the general description: “Active communities engage with a game by playing, providing feedback, and promoting (discussing online or in person, creating fan content, and so on) … Common reporting period intervals include 1-day, 7-day, and 30-day.” An arbitrary post from SurveyMonkey has some context for MAU.

Interesting: “engagement is the only KPI that some studios measure.“

Another relatively obvious tip: “Look at how studios with games like yours engage their community as a baseline for your own engagement efforts”, ie “competitive analysis”. But still, as a general primer, I appreciate the comprehensiveness.


“Your team needs the tools to isolate and identify problems, fix or escalate them … and communicate with players throughout the process.” 👍

Common tools:

  • External-facing ticketing, so internal and external actors can coordinate
  • “ ability to look up individual player profiles and make manual changes”. Ideally, a customer can do this themselves.
  • “A way to suspend or ban players”
  • “A way for players to upload crash logs” Seems this could be automatic, eg Crashlytics
  • “Ways to send messages to players” (and customers)

Good tip: “Changes in support contact and resolution rates (e.g. number of support tickets opened and closed) can indicate larger issues.”



I like the list of common metrics:

  • ARPU (Average Revenue Per User) … general business health“. I’m guessing percentiles would be good too
  • ARPPU (Average Revenue Per Paying User) … for gauging monetization strategies, such as store design or DLC“
  • (from the monetization section) “Paying rate is just as important as ARPPU for measuring monetization” I get the impression paying rate refers to the percentage of users who pay for anything
  • Unique Logins …  indicates newly acquired players”
  • Conversion Rate … success at converting free players into paid players.”
  • Retention … how well your game keeps players interested”
  • Average Session Length … how long your gameplay loop stays fun”
  • Session Frequency … how often players engage with the game”
  • LTV (Lifetime Value) … the total number of unique players divided by total revenue generated”
  • Errors … how stable your game is”
  • Content Logs … popularity, stability, and engagement of specific game content” This seems relatively game-specific, but I guess it could be generalized to feature-specific metrics

Good point: “Some metrics are best reviewed over long periods of time (e.g. Avg. Revenue), while others benefit from constant real-time updates (e.g. Errors)” And this may change over time, eg crash rates while changing a flag value.

Interesting: “Instead of boosting acquisition through marketing or app store advertising, they built traction by focusing on early retention metrics such as daily active users, session length, and crashes“ and “direct player feedback”

Good idea: “implementing direct player feedback through a public Trello community board,  letting users log bugs directly, and holding community votes on what to work on next.“

Good point: “Knowing your retention rate is important, but offers no insight on how to fix it. For that, you need to do a deep drill-down or segment your audience and experiment.”


“Segmenting groups is a necessary step to deliver the best content to the most players”

Good tip: “your analytics toolset should let you define custom segments”

Common use-cases:

  • “Designers segment players based on in-game behavior to understand their needs and develop player-centric content” presumably to increase retention
  • “Monetization teams use segments to understand spending patterns, identify fraudulent behavior, and predict revenue”
  • “Marketers create custom segments and optimize messaging for each to acquire or engage players”

“The most important thing about the testing aspect is the cohort and the segmentation …” 🤔 I guess an example would be identifying a low spending segment to test a feature to increase spending, as opposed to testing it on everyone, some of whom may already by spending a max.

A basic funnel:

  • New users
  • Non-spenders
  • Spenders
  • High spenders

“Once you [define a funnel like this], it’s easy to track your progress getting players to move through the funnel from one segment to the next.”

Good tip: “machine learning can help you automatically segment players”


“Good experiments have a hypothesis or some sort of goal KPI to change” 👍

I’m glad this is stated: “The size of your audience can affect how complex your testing can be. A game with millions of players can easily test subtle changes, but one with a smaller audience will only get significant data from tests with stark variations. The same goes for how many tests you can run simultaneously—a smaller player base means fewer simultaneous tests are statistically reliable.” I’d also say an opinionated approach, direct feedback and/or severely limited test concurrency can be a more efficient guide for a small user base than cluttering code with conditional logic and waiting a long time for significant data. Nice: “monitor user feedback … when player data is in short supply.”

Good tip: “be sure the test encompasses at least a whole week to measure fluctuations between weekday and weekend players” and users in different regions.

Interesting: “Make sure if one player sees something different from another, they can clearly understand why” I wonder if an example would be providing UI to list active experiments.

In-game surveys should “only ask one question at a time”

“Failed experiments are an important part of the process to learn” 👍

Best practices:

  • “Learn which metrics best capture performance for your game’s KPIs,and set appropriate periods to monitor and review them”
  • “Test gameplay mechanics early. It’s harder to test changes … after players have developed expectations” Reminds me of changes to Twitter UX basics, like changing the ⭐️ → ❤️
  • “When players have problems, analyze event history …” which implies an ability to collect and analyze such history is important, which may not be obvious before an issue happens
  • “Use limited-time events to test changes to gameplay—players are often more tolerant of gameplay changes when called out as events” Good idea. Reminds me of sports-based features, eg World Cup. I hadn’t thought of them as an opportunity to experiment w basic mechanics.
  • “Chart out the “funnel” progression for players in your game and experiment with ways to motivate players to move through your funnel”
  • “Ensure your analytics tools let you view KPIs by segment”
  • “Establish a clear success metric to gauge the impact of tests”
  • “Test qualitative factors by polling players with in-game surveys”


“It helps to put together a designated LiveOps team” I’ve also seen feature teams own their launches.

This seems like a launch checklist:

  1. Feedback loop and KPIs
  2. Support channels and data access guidelines
  3. Incident response strategy

Soft launch

Example soft launch: “choose a smaller geographic area, ideally with the same language as your core audience … and run your game for a few months” or “ limiting your initial audience with an Early Access or Beta period”. EAP and beta are something I have more experience with.

Good idea: “pay close attention to the core engagement metrics” for soft launch

Good idea: “During soft launch, confirm that you can update the game without causing disruption to players – and make sure that you can roll back changes if problems arise during deployment”, ie verify LiveOps infra works as expected.

“Many developers are moving away from soft launches in favor of lean launches“ 🤔… “As a small, indie studio, you don’t have the money to do user acquisition for a soft launch”

Lean launch

A lean launch:

  1. deploys an MVP version of the game
  2. connects with a target audience, and then 
  3. tunes the game based on player data and feedback


  • reliable data pipeline
  • smaller manageable audience without inflated expectations
  • be able to adapt your game quickly

“Collecting and analyzing your crash data and retention metrics is a must”, which is “ dependent on an effective LiveOps pipeline that allows for developing several pieces of content at once, and agile deployment”

Best practices

  • “Assemble a LiveOps team”
  • “Develop a calendar” to coordinate live updates post-launch
  • “Put validation checks in place” I guess because this approach is premised on making lots of significant changes, so the cost of failure is high
  • “Rehearse key LiveOps tasks”, which is good advice, but kind of contradicts an earlier statement “There’s no such thing as a dry run in live games”
  • “Ensure your team has a way to roll back changes ”
  • “Set roles and permissions”

Game updates

“Game updates aren’t limited to new levels or game mechanics. They can consist of new items for purchase, events, balance patches, bundles, or anything else that encourages a player to come back and play more.”

“Understanding your player base is a key element in designing and delivering relevant updates”

“Frequency and consistency are as important as quality when making updates”

Tip: experiment with time between updates in addition to the update content “to see if they impact engagement or retention.”

“save client updates for entirely new features or large assets … assets such as art and gameplay logic are included in the client, but how those assets are displayed to players is driven by server-side logic … plan your content architecture in advance and move as much of your game logic as possible onto the server or cloud.”

Best practices

  • “Make a list of everything in your game that could be considered ‘content’”
  • Plan how content will get to the client 👈
  • “Think about offline mode” 👍
  • “Vary your updates” between temporary and permanent changes
  • “Consider targeting new content to specific player segments”
  • “Consider cloud streaming or downloading assets in the background during gameplay to reduce friction”


“A live event is any temporary but meaningful change to a game’s content”

“Anything can be an event … Timebox it, reward it, there you go …”

“Successful events often include:

  • A measurable goal …
  • A limited-time period …
  • Engaging themes and content …
  • Surprise and predictability …
  • A sense of community effort …
  • An effective means of communicating with players …”

Reminds me of a “campaign” in other applications of targeted content.

Experiment w event frequency: “ By experimenting with event timing, they were able to settle on an event schedule that raised their baseline engagement while also minimizing lapsed players”

“Consider running repeatable events … Holidays work because players will be more understanding of temporary changes, and often have more time to play”

“Adding a special, limited-time leaderboard for a specific in-game goal is a common event.”

“Events can also run in parallel”


A calendar can help reduce the complexity of orchestrating events and avoid fatiguing users.


“Great player communication is critical to the success of live events”

Push notifications, email and social media are common channels of event communication.

Best practices

  • “Make a list of everything you might want to change as part of an event”
  • “Prepare to run events from the server, without a client update”
  • “Find natural ways to promote upcoming events in-game”
  • “Capture event data in your data warehouse“ for later analysis and segmentation
  • “Let your team be flexible when creating events” This seems like basic team management; micro-managing is bad
  • “Set goals for events” so we can evaluate performance
  • Maintain a calendar for coordination and to avoid fatiguing users
  • “Use events to experiment with ideas”
  • “Establish an event framework” that separates unique and repeatable aspects of an event


“… every discussion about monetization should consider:

  • The kind of game you’re building …
  • … [aligning] player needs with your revenue goals …
  • Ethical guidelines for monetization …
  • How your competition is monetizing … “


Aka “in app purchases” 👈

Common forms:

  • “Cosmetics are items that affect the physical appearance …”
  • “Account Upgrades are permanent enhancements to a player account …”
  • “Consumables are items that can be used once for a temporary effect …”
  • “VIP Programs are subscription-based programs …”
  • Content Access
  • “Random Boxes (or loot boxes) are items players can purchase without knowing exactly what they’ll receive”

Common “elements of in-game store management:

  • Presentation … should be easy to use …
  • Catalog management … (A good rule of thumb is once a week.) …
  • Pricing …
  • Offers and promotions …
  • Fraud … As soon as you start offering items with real-world currency value, there will be fraud …”

Nice: “Use server-side receipt validation … for added security”


I really like this topic. From the growth perspective, this is part of acquisition.

“two main challenges:

  1. eliminating the barriers to entry
  2. showing your players value”

The first one I’ve come to see a fundamental product consideration. If we want people to do anything, we need to minimize the cost of doing that thing. I think this also ties into an engineering best-practice: keep migrations and changes separate.

Regarding the second point, I think a great counter-example is a paywall before showing any content. “players have more of a propensity to pay once they have a trust

relationship with the game and the developer”

How players spend

I don’t have experience with in-app purchases, so this is all interesting.

“Players will have different levels of spending they are comfortable with”

“It’s easy to get caught up focusing on big spenders or trying to sell as much as possible as soon as the game launches. But those methods are often unreliable,

unsustainable, and may reflect poorly on your studio” Reminds me of low-quality ads, which eventually drive users off the platform.

“Build a broader, more reliable, and engaged spending base rather than chasing whales’ … A thousand players paying $10 is preferable to ten players paying $1000 because there is more opportunity for repeat purchases.”


“One of the most popular forms is rewarded video—short videos often promoting a different game or app, watched for an in-game reward or more playtime … [beware] players might be lured away by a competitor’s game.”

“As with almost every other LiveOps effort, you need to continuously test different solutions.”

Good idea: ”Many developers segment their audience and only show ads to certain segments, often limiting them to non-paying players.”


“You can usually do an on-the-fly calculation to compare the value per impression of an in-house-ad versus one from an external network, so you can decide what to show for a given player segment.”


“Many games use two virtual currencies: a “soft” currency earned ingame, and a “hard” purchased currency”

“Build a matrix of all the sources and sinks for in-game resources and build a model of the economic activity you can adjust in a tool such as Microsoft Excel, without rolling out updates.” I’ve heard of managing config this way.

“What we want is sustained investment and signs that a player has really perceived value…”

Best practices

  • Chose a strategy
  • Set ethical and quality guidelines
  • Prevent fraud
  • Simplicity and variety
  • Bundle commonly purchased items
  • Pair sales with events ← this reminds me of the growth practice of requesting feedback when engagement is high
  • Incentivize social sharing
  • Diversify ad networks
  • Keep loss aversion in mind
  • Always be testing “Never stop testing your monetization efforts, because your players’ perception of value (both real-world and in-game) will change over time“


“… detailed documentation on multiplayer architecture at”


“As soon as you add a leaderboard in a game, even if it’s a single-player game, players start seeing progress against other people, and people all of a sudden start engaging more” Makes me think there are mechanics for games based on human behavior comparable those used by growth features. For example, leaderboards increasing engagement highlights a human response to hierarchy.

Filtering makes leaderboards more fun:

  • Geo
  • Platform
  • Mode, eg player-vs-player
  • Option, eg difficulty
  • Level
  • Statistic, eg # wins
  • Time, eg stats for today

“combining the variables Platform, Level, and Statistic you could create a leaderboard for ‘Fastest time (Statistic) to complete Ventura Highway (Level) by PC players (Platform).’”

Leaderboards can also encourage social behavior, eg biggest contributor to team

An ability to reset the leaderboard can encourage participation

Award prizes based on achievements shown in the leaderboard.


“Groups … can get players more invested in a game”

Some group dynamics:

  • Communication
  • Game progress
  • Stats 

I wonder if these can be used for other groups, eg a working group.

Interesting: “Determine how short-term groups are formed based on how much players need to trust teammates to succeed … “

“Long-term groups (such as guilds) have been proven to increase player retention …” Seems like a form of “investment” that makes an app stickier. The fact that it was “proven” makes me think there might be papers to read.

“… how do I provide you the best experience not only within your guild, but when your guild is gone… It comes down to matchmaking … the right aspiration together as a group.” Reminds me of work dynamics.

Managing communities

“A dedicated community manager can help keep players satisfied and foster a positive community …” Reminds me of the dev “advocate” role

Some ways to avoid bad behavior:

  • Limiting communication options
  • Filtering words and phrases
  • Defining a code of conduct

“The team behind Guild Wars 2 reportedly built the whole game around the idea that ‘players should always be happy to see one another.’” 🙂

“The more you can provide a framework for people to operate in, the more likely they are …“


“50% or more of online users will only buy when presented offers in their native language.”

Good idea: given the localization team access to edit strings

“Store as much of the in-game text on the server as possible, so it can be easily edited and localized”

Best practices

  • Consider multiplayer early in development
  • Add multiplayer elements whenever possible
  • Experiment with matchmaking algorithms
  • Plan for multiplayer scaling needs
  • Offer multiple ways to communicate
  • Enable customization of groups, to increase engagement
  • Reset leaderboards on a regular basis
  • Award prizes based on leaderboard stats
  • Enable users to “refresh” game to explicitly load new config
  • Localize communications(!) 

Tools and services

The guide lists PlayFab’s API, but I think it’s more interesting as an overview of useful entities and controls:

  • Auth
  • Content
    • Game content
    • User generated content
  • User data
  • Matchmaking
  • Leaderboards
    • Tournaments
    • Reset schedules
    • Prizes
    • Fraud prevention
  • Communication
    • P2p
    • Text and voice with transcription and translation
    • Accessibility (speech to text and vice versa)
  • Eng controls
    • Config
    • Reporting
    • Events
    • Automation
    • Scheduling
  • Product & community controls
    • Reporting
    • Event log
    • User management
    • Automation
    • Scheduling
    • Segmentation
    • Experimentation
    • Messaging
  • Economics controls
    • Stores
    • Sales
    • Economy
    • Fraud prevention

Software ecology

A documentary about an eco-friendly home near Austin inspired me to think about software systems from an ecological perspective.

The notion of a software or product “ecosystem” isn’t new, but I’d previously only thought about it as fostering healthy interactions in a system; I hadn’t considered the non-human actors. Is the code hard to maintain? Are alerts waking people up unnecessarily? Is the business sustainable? Is there a natural order? Is anything out of place, like an old tire in a stream? Can we achieve our goals in harmony with the natural order?

For example, I worked on a free service that would alert when resources were exhausted. Because it was free, it was natural for consumers deprioritize efficient usage. Maintainers of the service absorbed the cost in the form of routine alerts. A more balanced system would shift some cost to the consumers.

I think the idea of separating concerns is another example. Decoupling can reduce maintenance cost even if the functionality doesn’t change.

A colleague once remarked that every syntax variation allowed by a language would eventually appear in a code base; a convention could not stop this. Perhaps this was another example of an imbalanced natural order. The cost of enforcement was solely on the reviewer. Shifting this cost to programmatic validation, like a linter, would help restore balance.


TLDR: like other REPLs, JShell provides an easy way to test Java one-liners, and, like the Rails console, a handy ad hoc CLI.

I appreciate a REPL for quickly checking the validity of small snippets. For example, I can improve the quality of my code reviews by verifying an idea works in a REPL before recommending it in a review.

My first exposure to a Java-esque REPL was the Scala REPL, which could also interpret Java. This was handy, but only easily available when Scala is installed.

When Scala wasn’t installed, I used for Java, but this is a public site , so I need to be mindful not to use it for anything confidential, and it can take some time to load.

Recently, I learned about JShell, which is included in the JDK as of version 9.

Aside, in case you’re on a Chromebook, Google’s Cloud shell is great ad hoc terminal.

Per the JShell docs, I can start/stop the shell:

$ jshell
|  Welcome to JShell -- Version 11.0.6
|  For an introduction type: /help intro

jshell> /exit
|  Goodbye

Hello world:

jshell> System.out.println("hi")

Note implicit semicolons for simple code. Return statements appear to need explicit semicolons, though, eg:

jshell> String get(){
   ...> return "s"
   ...> }
|  Error:
|  ';' expected
|  return "s"
|            ^

From the docs, I see there are “scratch variables”, which reminds me of the Scala REPL’s res variables:

jshell> 2+2
$3 ==> 4

jshell> $3
$3 ==> 4

jshell> $3 + 2
$5 ==> 6

The feedback is comparable to javac, eg:

$ cat
import java.util.HashMap;
import java.util.Map;

class MyMap {
        static Map<String, String> m = new HashMap<>();
        public void put(String k, String v){
        public void get(String k){
                return m.get(k);

$ javac error: incompatible types: unexpected return value
                return m.get(k);
1 error
$ jshell
|  Welcome to JShell -- Version 11.0.6
|  For an introduction type: /help intro

jshell> /open
|  Error:
|  incompatible types: unexpected return value
|                  return m.get(k);
|                         ^------^

We can use tab completion:

jshell> List
List                 ListIterator         ListResourceBundle


<press tab again to see documentation>

jshell> List.
class     copyOf(   of(

jshell> List.of("1", "2")
$3 ==> [1, 2]

jshell> $

The up arrow scrolls back through history. We can also print it:

jshell> /list

   1 : List.of("1", "2")
   2 : $>System.out.println(n))

We can also search history, eg Ctrl + R for reverse search:

(reverse-i-search)`hi': System.out.println("hi")

Editing multi-line code is cumbersome, eg:

jshell> class Foo {
   ...> String get(){}
   ...> }
|  created class Foo

// Up arrow to edit method definition

jshell> String get(){
   ...> return "s";
   ...> }
|  created method get()

// Creates new function instead of editing Foo.get

jshell> get()
$10 ==> "s"

jshell> Foo f = new Foo();
f ==> Foo@1e88b3c

jshell> f.get();


I’d recommend using an external editor for anything non-trivial.

JShell has an /edit command to launch an external editor, but it doesn’t appear to save the output.

jshell> /set editor vim
|  Editor set to: vim

jshell> class Foo {}
|  created class Foo

jshell> /edit Foo // add bar method to Foo
|  replaced class Foo

jshell> Foo f = new Foo()
f ==> Foo@56ac3a89

|  Error:
|  cannot find symbol
|    symbol:   method bar()
|  ^---^

jshell> /edit Foo // Observe bar method is undefined

I’d recommend just having an editor open in a separate terminal, and using JShell’s /open command to load the file after changes.

For folks using Google Cloud Shell, it appears to have an implicit tmux session, which makes it easy to edit in one pane and use JShell in another.

In practice, I’m guessing there’s little use for JShell when editing complex code, but it does provide a handy CLI for exploring complex code. We could have a build target, like pants repl, or a CLI for our app, like rails console.

For example, given a naive script

javac -d bin src/main/com/example/* \
        && jar cf bin/MyMap.jar -C bin com \
        && jshell --class-path bin/MyMap.jar

We could:

$ ./                                                                                                                                                          
|  Welcome to JShell -- Version 11.0.6
|  For an introduction type: /help intro

jshell> import com.example.MyMap;

jshell> MyMap m = new MyMap();
m ==> com.example.MyMap@6a41eaa2

jshell> m.put("k", "v")

jshell> m.get("k")
$4 ==> "v"

Customer feedback

Customer feedback can come through a variety of channels. Here’s a list of the ones I’ve found valuable, and practices for aggregating feedback through these channels.


Inline survey

This is an in-context prompt for feedback, eg
“Like the service? 😀 | 😐 | 😭”


  • Relatively easy for feedback provider, resulting in broad participation
  • Confidential


  • Limited signal
  • Hard to filter for quality participation

User research

This is when a company makes an open call for feedback.


  • Long-form discussion
  • More control over participation than inline surveys
  • Confidential


  • Quality of participants can vary

Partner interviews

This is when a company solicits feedback from valuable customers. I’ve found it useful to do this periodically, eg yearly.


  • Long-form discussion
  • High-quality participants
  • Confidential


  • Risk of focusing too closely on feedback from a small group

GitHub issues

GitHub provides a way for users of a repository to report issues.


  • Public, so customers can pile on and follow progress
  • Intuitive for GitHub users
  • Reactions help capture sentiment
  • Easy to discover, eg via Web search
  • Enables customers to crowd-source support


  • Specific to a repository


This is when a company maintains a public Slack channel.


  • Good for quick Q/A
  • Enables customers to crowd-source support


  • Can be noisy/interruptive. A support rotation can help distribute cost across team.

Stack Overflow

Stack Overflow enables people to ask questions about anything, but companies can use it’s features to support customers and collect feedback.


  • Easy to discover, eg via Web search
  • Enables customers to crowd-source support


One challenge of having numerous feedback channels is distilling themes, identifying task and communicating progress. I’ve seen a few patterns to address this.

Product management

Collecting feedback from customers and using that feedback to guide planning is basic product management. Engineers may end up doing aspects of this, but they should get appropriate credit.

Dedicated support

This is when a company staffs a team to monitor all feedback channels. This team can also aggregate reports and/or work with product management to identify themes.

Support rotation

Distribute the cost of monitoring feedback sources across the team using a weekly rotation. This person will be completely distracted and may identify issues, so it pairs well with an on-call rotation.

Centralized tracking

This can be a spreadsheet or a full-featured bug tracking tool like Jira, but it’s helpful to have a single place to prioritize all issues and feature requests.

Delegated communication

If there’s a team that, say, owns a given GitHub repository, they can own the task of keeping GitHub issues for that repository synchronized with centralized tracking state.

Crowd-sourcing themes

One pattern I’ve seen work well is to split a team into groups, assign each group a subset of feedback, and ask each group to identify themes in that feedback. For example, as part of quarterly planning. This works well with large bodies of inline survey results.

Batch to SSTable

A pattern I’ve seen a couple times for immutable data:

  1. Generate the data using a batch process
  2. Store the data in an an indexed structure (like SSTable)
  3. Expose the structure through an API

The result is a key-value store with extremely high read performance.

The first time I heard about this was Twitter’s Manhattan database. Recently, I saw the pattern again at a different company. Ilya Grigorik wrote about it several years ago in the context of log-structured data, BigTable and LevelDB.

My takeaway is: this pattern is worth considering if:

  • my current store is having issues (no need to fix what’s not broken)
  • I have heavy read traffic
  • I can tolerate latency on updates

The context of log-structured makes me think that might open a door to write access too. Twitter’s post mentions a “heavy read, light write” use-case, although it also describes use of a B-tree structure rather then a simple sorted file for that case. Grigorik’s post mentions BigTable uses a “memtable” to facilitate writes.

Note Web’s IndexedDB has a similar access pattern to SSTable. If I think about remote updates as an infrequent write, then the pattern described here might be a common use-case for Web, which might bring this around full circle: Google crawls the Web in a batch process and updates an index which is read-heavy.

Easy & advanced

An API design pattern I’ve found helpful is to think about usage in two modes: easy and advanced.

This is especially helpful in debates. We may be able to accommodate a valid, but advanced, feature without cluttering the API, by housing it in an advanced subset of the API and docs.

I recently heard another phrasing of the same idea from David Poll: “Common case easy. Uncommon case possible.”

Site reliability

Google’s SRE handbook, summarized in The Calculus of Service Availability, and the accompanying Art of SLOs workshop materials, are great. Here are a few things that stand out to me.


As defined in the chapter on Service Level Objectives:

  • SLI = Service Level Indicator. In other words, a specific metric to track. For example, the success rate of an endpoint. Have as few as possible, to simplify reasoning about service health.
  • SLO = Service Level Objective. A desired SLI value. For example, a 99% success rate.
  • SLA = Service Level Agreement. A contractual agreement between a customer and service provider defining compensation if SLOs are not met. Most free products do not need SLAs.

The “Indicators in Practice” section of the SLO chapter provides some helpful guidelines about what to measure:

  • User-facing serving systems –> availability, latency, and throughput
  • Storage systems –> latency, availability, and durability
  • Big data systems –> throughput and end-to-end latency

In the context of less is more, note each domain has 2-3 SLIs.

Reasonable SLOs

Naively, I’d think a perfect SLO would be something like 100% availability, but the “Embracing Risk” chapter clarifies all changes have costs. Striving for 100% availability would constrain all development to the point where the business might fail for lack of responsiveness to customer’s feature requests, or because it spent all its money on monitoring.

Additionally, customers might not notice the difference between 99% and 100%. For example, “a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!”

Related, a dependency’s SLOs can also provide a guideline. For example, if my service depends on a service with 99% availability, I can forget about 100% availability for my service.

I find downtime calculations (eg helpful for reasoning about appropriate SLOs:

  • 90% = 36d
  • 99% = 3d/yr
  • 99.9% = 8h/yr
  • 99.99% = 52m/yr
  • 99.999% = 5m/yr

The Art of SLOs participant handbook has an “outage math” section that provides similar data, broken down by year, quarter and 28 days.

So, if our strategy is to page a person and have them mitigate any issue in an hour, we might consider a 99.99% SLO. A 5 minute mitigation requirement is outside human ability, so our strategy should include something like canary automation. In this context, a small project involving a couple people on a limited budget should probably consider a goal of 90% or 99% availability.

I found it helpful to walk through an example scenario. The Art of SLOs participant handbook provides several example “user journeys”. For example, I work on a free API developers consume in their apps. This fits the general description of a “user-facing systems”, so availability, latency, and throughput are likely SLIs. Of these, most support requests concern availability and latency, so in the spirit of less is more, I’d focus on those.

I have an oncall rotation, pager automation (eg PagerDuty), and canary automation, but I’m also building on a service with a 99% availability.

We can reasonably respond to pages in 30 minutes, and fail out of problematic regions within 30 minutes after that, but we also have occasional capacity issues which can take a few hours to resolve.

So, 99% seems like a reasonable availability SLO.

A latency SLI seems more straightforward to me, perhaps because it can be directly measured in a running system. One guideline that comes to mind is the perception of immediacy for events that take less than 100ms.

A nice data mart 🏪

The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team

I don’t have a lot of experience with data marts, but I recently met one that seems nice and simple.

The store benefits from a few other abstractions:

  1. a service that just ingests and persists client events
  2. a query abstraction, like Hive
  3. trustworthy authentication and list membership infra

Given these, the store in question simplifies the process of utilizing data by abstracting a few common requirements:

  1. a simple config DSL specifies which query to run, the frequency to run it, the output table, deletion conditions, etc. Specifying config via files enables use of common source control tools.
  2. three predefined processing stages (raw-to-normalized, normalized-to-problem-specific, problem-specific-to-view-specific). New event sources, aggregations and views can be independently defined by adding new config files.
  3. common styling and libraries for data visualization
  4. access is generalized to a few tiers of increasing restriction, eg team, division, company. The lowest level might be freely granted to teams for their own business intelligence, and the highest level restricted to executives for making revenue-specific decisions.

In retrospect, this seems pretty straightforward. I’m remembering a tool from another team (basically Hadoop + Rails + D3) that had the same goals, but didn’t have the query, scheduling or ACL abstractions underneath. It was replaced by an external tool that was terrible to the point of being unusable, but more secure. Eventually, we dumped normalized data in a columnar store that was also secure and easier to use for our team’s business intelligence, but would’ve been insufficient for things like periodically updating charts. I guess it’s the combination of data store features and supporting infra that makes the magic happen.

Mobile growth: personalization

Personalization can take a number of forms:

  • configuration
  • notifications
  • content
  • sponsored

All these forms require “targeting” logic.



A JSON blob on a CDN enables configuration, but it doesn’t take qualities of the caller into account. We can personalize configuration by running it through a targeting layer.

Services and clients require configuration independent from release, but a distinction is the target. If the service is the target, then we probably want to use a low-level, service-oriented config infra, like Zookeeper. If the user of that service is the target, then there will likely be overlap with the configuration mentioned here.


The pattern of targeting groups of users for notifications is well-established, so I’ll just reiterate targeting and campaign tooling can be reused for other forms of personalization.


A few examples of personalized content:

  • recommendations, like “customers who bought this also bought …”
  • content tailored to the user, eg Twitter’s curated timeline
  • notifications inside an app

An important distinction is: manual vs automated content management. Highlighting an upcoming conference for folks in the area using an in-app notification would be an example of the former. Prioritizing recent articles from the user’s location would be an example of the latter.


I read somewhere that the ideal ad is content; ads are annoying insomuch as they’re not what we’re looking for.

I suppose a clear distinction between sponsored content and other forms of personalized content is that the former is paid for, but otherwise, the line seems blurry. Both are targeted, and can be statically or dynamically defined.


Targeting inputs can be “online” and/or “offline”. An example of the former would be using data from the User-Agent header of an HTTP request to tailor the response. An example of the latter would be using aggregate analytics data to tailor the response. Both can be used together. For example, prioritizing online inputs if latency of offline collection is a problem, or prioritizing offline if aggregation produces a higher-quality input.

An important point is trying to consolidate targeting logic. It’s unsurprising for, say, Notifications and Ads to be different orgs, but both orgs independently developing User-Agent parsing, IP-geo resolution, syntax for declaring conditions, etc is a waste. I find it helpful to think of targeting as a utility for many forms of personalization.

Providing a DSL for targeting enables customers to use standard source code management tools and practices. An example of targeting syntax:

condition is_ios = == 'iOS'
param message = is_ios ? 'Hi, iOS' : 'Hi, Android' 

Note this example is almost JavaScript. I’d be curious to experiment with using a JS runtime for targeting definition and evaluation.