Software engineering notes

Site reliability

leave a comment »

Google’s SRE handbook, summarized in The Calculus of Service Availability, and the accompanying Art of SLOs workshop materials, are great. Here are a few things that stand out to me.


As defined in the chapter on Service Level Objectives:

  • SLI = Service Level Indicator. In other words, a specific metric to track. For example, the success rate of an endpoint. Have as few as possible, to simplify reasoning about service health.
  • SLO = Service Level Objective. A desired SLI value. For example, a 99% success rate.
  • SLA = Service Level Agreement. A contractual agreement between a customer and service provider defining compensation if SLOs are not met. Most free products do not need SLAs.

The “Indicators in Practice” section of the SLO chapter provides some helpful guidelines about what to measure:

  • User-facing serving systems –> availability, latency, and throughput
  • Storage systems –> latency, availability, and durability
  • Big data systems –> throughput and end-to-end latency

In the context of less is more, note each domain has 2-3 SLIs.

Reasonable SLOs

Naively, I’d think a perfect SLO would be something like 100% availability, but the “Embracing Risk” chapter clarifies all changes have costs. Striving for 100% availability would constrain all development to the point where the business might fail for lack of responsiveness to customer’s feature requests, or because it spent all its money on monitoring.

Additionally, customers might not notice the difference between 99% and 100%. For example, “a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!”

Related, a dependency’s SLOs can also provide a guideline. For example, if my service depends on a service with 99% availability, I can forget about 100% availability for my service.

I find downtime calculations (eg helpful for reasoning about appropriate SLOs:

  • 90% = 36d
  • 99% = 3d/yr
  • 99.9% = 8h/yr
  • 99.99% = 52m/yr
  • 99.999% = 5m/yr

The Art of SLOs participant handbook has an “outage math” section that provides similar data, broken down by year, quarter and 28 days.

So, if our strategy is to page a person and have them mitigate any issue in an hour, we might consider a 99.99% SLO. A 5 minute mitigation requirement is outside human ability, so our strategy should include something like canary automation. In this context, a small project involving a couple people on a limited budget should probably consider a goal of 90% or 99% availability.

I found it helpful to walk through an example scenario. The Art of SLOs participant handbook provides several example “user journeys”. For example, I work on a free API developers consume in their apps. This fits the general description of a “user-facing systems”, so availability, latency, and throughput are likely SLIs. Of these, most support requests concern availability and latency, so in the spirit of less is more, I’d focus on those.

I have an oncall rotation, pager automation (eg PagerDuty), and canary automation, but I’m also building on a service with a 99% availability.

We can reasonably respond to pages in 30 minutes, and fail out of problematic regions within 30 minutes after that, but we also have occasional capacity issues which can take a few hours to resolve.

So, 99% seems like a reasonable availability SLO.

A latency SLI seems more straightforward to me, perhaps because it can be directly measured in a running system. One guideline that comes to mind is the perception of immediacy for events that take less than 100ms.

Written by Erik

December 21, 2019 at 10:35 am

Posted in book, SRE

A nice data mart 🏪

leave a comment »

The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team

I don’t have a lot of experience with data marts, but I recently met one that seems nice and simple.

The store benefits from a few other abstractions:

  1. a service that just ingests and persists client events
  2. a query abstraction, like Hive
  3. trustworthy authentication and list membership infra

Given these, the store in question simplifies the process of utilizing data by abstracting a few common requirements:

  1. a simple config DSL specifies which query to run, the frequency to run it, the output table, deletion conditions, etc. Specifying config via files enables use of common source control tools.
  2. three predefined processing stages (raw-to-normalized, normalized-to-problem-specific, problem-specific-to-view-specific). New event sources, aggregations and views can be independently defined by adding new config files.
  3. common styling and libraries for data visualization
  4. access is generalized to a few tiers of increasing restriction, eg team, division, company. The lowest level might be freely granted to teams for their own business intelligence, and the highest level restricted to executives for making revenue-specific decisions.

In retrospect, this seems pretty straightforward. I’m remembering a tool from another team (basically Hadoop + Rails + D3) that had the same goals, but didn’t have the query, scheduling or ACL abstractions underneath. It was replaced by an external tool that was terrible to the point of being unusable, but more secure. Eventually, we dumped normalized data in a columnar store that was also secure and easier to use for our team’s business intelligence, but would’ve been insufficient for things like periodically updating charts. I guess it’s the combination of data store features and supporting infra that makes the magic happen.

Written by Erik

October 14, 2019 at 8:52 pm

Posted in pattern, tool

Dave Chang interviews Preet Bharara

leave a comment »

Dave Chang, chef and restaurateur, interviewed Preet Bharara, former US attorney for the Southern District of New York on June 20th, 2019. Their discussion focused largely on professional development and management and identified commonalities in their respective professions. I too identified commonalities with my experience in software development, so I wanted to take some notes for future reference. I tried to clearly quote and paraphrase, but this isn’t intended to be a transcript. I summarize, merge, reorder and comment on the thoughts.

Bharara’s book Doing Justice was the inspiration for the interview. Chang’s recommended it to many of his managers. He found the process by which Bharara approached criminal justice was the same as for being a chef. Bharara clarifies the book was intended for anyone, at any level of an organization, and is about how to approach problems in general.

The book has a chapter on asking questions. Junior and senior members of an organization need to feel safe asking questions. Chang comments that not asking questions is how we start making mistakes.

Innovation is important, but isn’t always dramatic. Innovation can take the form of thinking about a problem differently, and come from anyone at any level of an organization. It’s important to cultivate a “culture of innovation”. It’s insufficient to have a couple innovative people. Management must respect innovation.

This ties back to asking questions; it’s important to understand and question the rationale behind the status quo. Doing things one way because that’s the way they’ve always been done is insufficient. For example, until recently no one thought to use wire-tapping for inside trading investigations, even though insider trading is an act of communication.

Bharara describes the slow pace of change in government and how we have to work around that. His was the first US attorney’s office to contract with a data analysis firm, and the first to have a twitter account. He describes people looking at him like had three heads, which I find validating; I’ve felt insecure before expressing an unorthodox view.

At one point, Bharara recognized he understood law well enough, but had no training in business management. Two things are important for institutions: continuity and change. We want continuity of values and a culture of innovation.

People don’t like change. The legal and culinary professions are especially averse to change. Chang describes chefs dismissing advanced cooking techniques: “I don’t need to learn anything. I have fire. I have chicken.”

Chang wrote down a line from the book: “There are people who fight for the status quo and reject change.” (Aside, this fits the Economist’s definition of “conservatism”, so there’s a positive side of this too.)

How do you change an organization to be more open to innovation? It may be sufficient to identify and support a minority who embrace the concept. Bharara comments: “Not everyone is going to be a leader. You want to have more leaders than the average organization, but a lot of people are going to be followers.” Chang asks: “Can you be effective at your job as a follower?” Bharara clarifies a distinction between innovation and execution. Both are valuable. Some people are better at one or the other. Few are good at both.

Different people have different skills. As we ascend higher in an organization, people become more specialized. Any one person may be better than average in any role, but the difference between the best team and the worst team is how closely people’s skills are aligned with their roles.

Chang asks Bharara if he was successful at identifying people’s skills in his first management job as US attorney. He responds: the most important part of being a leader, because one person can’t do everything, is identifying who the best people are and putting them where they belong. He consulted a lot. He also saw value in a balance of, say, aggressive and cautious, people on a team.

Promotion isn’t always a good idea. Different levels require different skills. Bharara’s unaware of any manager who doesn’t have a record of significant personnel mistakes. Chang comments this is exactly the case in the culinary industry too. Traditionally, people are promoted in a kitchen based on excellence at one level, but that is not an indicator of excellence at the next level. Bharara agrees and describes law offices as typically having poor management because there are no business people involved. Management and individual contribution are completely different skills and have a completely different reward structure. For example, no US attorney has tried a case in recent history because they’re busy managing. Bharara codified this by prohibiting leads from participating directly in cases describing that as an “indulgence” for a manager.

Self-awareness is important in this context. It’s beneficial for people to recognize whether they want to manage other people. Bharara would ask people seeking promotion “Are you sure you want that?” He continues: “Some people are so ambitious they think that there’s a natural progression to their career that must include certain kinds of promotion … I wish more people thought about their own fun and ability rather than always being on this rat race to have another item to put on their resume.”

Chang acknowledges the same is true in the kitchen. As an executive chef, he no longer cooks. Bharara talks about his difficulty describing his job, but it boils down to “meetings”. Leaders oversee and are the outward face of an organization. Bharara paraphrases Pat Fitzgerald, a former US attorney: when you have the right people, who know what they’re doing, the job of the leader is to get out of the way and let them do it, and when they’re not doing it, to steer them.

Likewise, folks in leadership should be aware of their skills, roles and the fact they are often not the best choice for direct execution. “If you want something done well, you have to do it yourself” is an anti-pattern in this context. Chang describes chefs taking direct control in times of stress, but because they’re not involved in the day-to-day production of the kitchen, this is often disastrous: “You’re going to ruin the flow of the kitchen through entirely one’s own ego.”

Bharara describes two motivations for this:

  1. leadership thinks “I could do this better”
  2. leadership is trying to demonstrate they add value

Chang describes a risk at the upper limit of ability: the person eventually creates something only they can maintain, or they run the organization in a way that only works for them. Both are bad for the organization. Part of Chang’s job is to shake them out of it.

Chang asks how do people adapt to loss of control and recognition they’re not the best at everything. Humility is helpful. People presume the head of the office is the best at everything. Bharara describes having “warring self-doubt”, which was validating for me to hear. He also describes being very nervous about starting a new stage in his career when he created his podcast. Consulting with experts is essential. The difference between good and great is consultation.

Talent is often the greatest obstacle to becoming a great chef. At some point, talent is no longer the most important factor to success. Chang’s describes telling talented people they need to grow up. I wonder if he’s talking about a threshold between individual contribution and management, and if this is in conflict with the earlier discussion about self-awareness.

The interview closes with discussion of making decisions under stress. Chang jokes that every day in a restaurant is like defusing a bomb. Bharara describes the tension between imminent danger and sufficient evidence. I suppose the general theme is problem solving under pressure.

Another general theme is how to prepare for the unknown. Judgement is as important as education and credentials. Bharara comments there are lots of intelligent people he wouldn’t put in charge of anything; some folks are much more comfortable with contemplation over a long period of time. Bharara describes the importance of core values in these moments. For SDNY, the mission is: “Do the right thing in the right way for the right reasons every day and that’s all.” He paraphrases To Kill A Mockingbird: trying to do the right thing, is the right thing.

Good behavior is important for effective management. Fear, intimidation and perfectionism doesn’t work in the real world. Bharara paraphrases Eisenhower: hitting people over the head is assault, not leadership. Empathy, respect and even temperament are much better. Chang acknowledges that’s a lot to ask of someone who just wanted be a cook.

Bharara states one of his goals is to never let people you lead see you freak out. He freaks out with his closest deputies, but the organization as a whole benefits from calm leadership.

Written by Erik

October 13, 2019 at 11:48 am

Posted in perspectives

Deep Work by Cal Newport 📖

with one comment

I’ve found I need uninterrupted time to focus for software projects at work, and actually leave work unsatisfied if I’m unable to find this time. While investigating what this is all about, I came across Newport’s Deep Work.

The first half of the book makes a case for investing in focus time, especially for folks performing “knowledge work”, which can require workers to hold many details in mind to produce significant output. The second half of the book provides practical recommendations for how to set aside uninterrupted time.

Focus time

Like Paul Graham’s Maker’s Schedule, Manager’s Schedule, the first recommendation is to explicitly identify and reserve time for focused work, which I found validating.

In practice, I find a couple common sources of distractions:

  • random requests and discussions via email, chat, etc
  • random requests and discussions around my desk area

The book recommends avoiding interruptive electronic communications, especially social media, when trying to focus, and makes a case that most interruptions can wait.

Regarding in-person interruptions, the author mentions sequestering himself in a library. I might try something similar.

One challenge I see: focus requirements might be dependent on when as well as what. For example, the early days of a project might be meeting-heavy as requirements are clarified, and then focus-heavy as pieces are built.

My current experiment is to reserve 8-10 and 2-5 for focus work. During these times, I’ll sit away from my desk and ignore chat and email. This leaves a couple hours before and after lunch to catch up on electronic communication, have spontaneous discussions and provide a reasonable window for scheduling meetings.

The book cites research indicating our ability to work deeply maxes out around four hours, which also aligns with my experience. I previously experimented with reserving 1-5, but found it was too restrictive for natural interactions with colleagues, and I didn’t need the entire time anyway.

In an effort to provide “ownership” for my project, I took pains to be aware of all changes and support requests. This obviously doesn’t scale, but I appreciated reading a supportive quote from Tim Ferris:

Develop the habit of letting small bad things happen. If you don’t, you’ll never find time for the life-changing things.


An interesting aspect of this book is it’s not simply presenting thoughts on the topic of deep work; it was motivated by the author’s need to optimize productivity. In this context, the book introduced me to 4DX, which is like agile boiled down to four points. As with agile, I find myself referencing it often.

In particular, I appreciate the emphasis on prioritization; during focused effort, we say “no” to many things so we can say “yes” to the most important things.

The book also mentions the idea of “results-driven reporting”, which defers meetings until there’s something new to report, as an alternative to status updates. This aligns with recommendations to prioritize requirements and blockers over status updates in cadence meetings from a recent talk on politically-charged projects.

Written by Erik

October 6, 2019 at 6:41 pm

Posted in book

Politically-charged projects

with one comment

I attended a talk yesterday that shared best-practices from managing two large projects that suffered from competing priorities:

  • prioritize, prioritize, prioritize <– reminds me of 4DX
  • put goals, tasks, vocabulary, agreements, engineering guidance, etc in writing to clarify communication
  • use pilot programs to clarify requirements
  • hold regular cadence meetings <– reminds me of 4DX
  • focus cadence meetings on clarifying requirements and getting help more than status updates
  • have periodic summits to build team cohesion
  • invite folks from different teams and with different roles to the summits to get a diversity of perspectives
  • identify and support “influencers” in the teams you need help from

One of the problems involved signing a contract before performing any engineering feasibility analysis and then having to turn the org on a dime to meet the deadline. 🦶🔫 I have experience being on the other side of abrupt changes and wondering what was going on, so it was validating to hear the context.

Written by Erik

October 3, 2019 at 8:02 pm

Posted in org

Tagged with

Mobile growth: personalization

with one comment

Personalization can take a number of forms:

  • configuration
  • notifications
  • content
  • sponsored

All these forms require “targeting” logic.



A JSON blob on a CDN enables configuration, but it doesn’t take qualities of the caller into account. We can personalize configuration by running it through a targeting layer.

Services and clients require configuration independent from release, but a distinction is the target. If the service is the target, then we probably want to use a low-level, service-oriented config infra, like Zookeeper. If the user of that service is the target, then there will likely be overlap with the configuration mentioned here.


The pattern of targeting groups of users for notifications is well-established, so I’ll just reiterate targeting and campaign tooling can be reused for other forms of personalization.


A few examples of personalized content:

  • recommendations, like “customers who bought this also bought …”
  • content tailored to the user, eg Twitter’s curated timeline
  • notifications inside an app

An important distinction is: manual vs automated content management. Highlighting an upcoming conference for folks in the area using an in-app notification would be an example of the former. Prioritizing recent articles from the user’s location would be an example of the latter.


I read somewhere that the ideal ad is content; ads are annoying insomuch as they’re not what we’re looking for.

I suppose a clear distinction between sponsored content and other forms of personalized content is that the former is paid for, but otherwise, the line seems blurry. Both are targeted, and can be statically or dynamically defined.


Targeting inputs can be “online” and/or “offline”. An example of the former would be using data from the User-Agent header of an HTTP request to tailor the response. An example of the latter would be using aggregate analytics data to tailor the response. Both can be used together. For example, prioritizing online inputs if latency of offline collection is a problem, or prioritizing offline if aggregation produces a higher-quality input.

An important point is trying to consolidate targeting logic. It’s unsurprising for, say, Notifications and Ads to be different orgs, but both orgs independently developing User-Agent parsing, IP-geo resolution, syntax for declaring conditions, etc is a waste. I find it helpful to think of targeting as a utility for many forms of personalization.

Providing a DSL for targeting enables customers to use standard source code management tools and practices. An example of targeting syntax:

condition is_ios = == 'iOS'
param message = is_ios ? 'Hi, iOS' : 'Hi, Android' 

Note this example is almost JavaScript. I’d be curious to experiment with using a JS runtime for targeting definition and evaluation.

Written by Erik

October 3, 2019 at 7:10 pm

Posted in mobile-growth

Saying yes to important things

with one comment

I attended a talk today on the art of saying no and a couple things stood out:

  • saying yes to important things makes it easier to say no to less important things, ie prioritize
  • if the impact of requested work is unclear, request clarification before agreeing to do the work

This reminded me of something a colleague once mentioned: part of the art of saying no effectively is making everyone aware of the costs

Written by Erik

October 3, 2019 at 7:00 pm

Posted in org

Tagged with

Mobile growth: experimentation

with one comment

Once we have a way to quantify usage, we can compare usage between variants of an app.

To do this, we need to log an event when someone enters an experiment variant. I’ve heard this referred to as an “impression” or “activation” event. Firebase ABT simplifies things a bit by enabling developers to identify an existing event for this purpose. The basic idea is to serialize events by time, identify a common start point (the activation event), and then compare the events after for things like increased signups, or increased time using the app, increased purchases, etc.

It’s critical this event is logged equivalently for all variants so we can compare apples to apples. This is an example of where QA features in analytics SDKs and services is helpful.

Testing identical variants (“A/A testing”) is helpful for identifying issues in analysis infrastructure.

As with analytics, building experimentation infrastructure is non-trivial and the cost of errors is high, so using an existing provider is advisable.

Written by Erik

September 28, 2019 at 7:22 pm

Posted in mobile-growth

Mobile growth: analytics

with one comment

Local features of an installation, like locale or device type, provide a limited opportunity for personalization. Defining a mechanism for communicating feedback from an app to the service supporting the app expands the range of opportunity.

Analytics infra generally provides a few things:

  • a process for defining events
  • an SDK for logging events and communicating them to a service
  • service infra to persist a high volume of events
  • storage for a large volume of events
  • stream, batch and or ad hoc aggregation
  • visualization of aggregate data

Given all this is non-trivial, and the cost of errors is high, using one of the many existing analytics providers is advisable.


Logging garbage is costly. A simple example would be defining events as simple strings, misspelling an event string, failing to include the misspelling in aggregation logic resulting in an erroneous report, and basing a business decision on the report. The latency involved in collecting, aggregating and analyzing event data can make such errors hard to detect.

A process and tooling for explicitly defining event types can reduce the risk of logging garbage. For example, we can use protobuf to define events and source control to oversee protobuf maintenance, and then use the protobuf consistently at all layers, from event generation to aggregation.


A simple SDK can just have a method to log events, a buffer of events, and network logic to flush the buffer periodically to the analytics service.

One nuance concerns the priority of events. For example, we might want to report errors immediately, or monitor events more closely during a release.

Because the events logged by the SDK are critical for growth functionality, providing a way to mock the SDK in tests is helpful for QA.

I’m sure there are a million other nuances folks on analytics teams can speak to, but from the perspective of an SDK user, I just need a way to log events (and assert they were logged correctly).


My only experience with analytics services concerns asserting events were logged correctly.

Enabling developers to point an SDK at a mock endpoint and listen to the event stream is helpful for development. Enabling test infra to access the resulting logs enables integration testing.


Providing intermediate columnar storage, like Dremel or Vertica, is helpful for ad hoc analysis.

Providing access control at the storage layer ensures data is only visible to those who need it.


We typically need to aggregate analytics data for it to be useful. For example, signups per day vs a single signup event. To this end, tools supporting aggregation, like Flume, are helpful.


Analytics data is often presented as a time-series. Storage and client-side tools for displaying time-series data are helpful.

Written by Erik

September 28, 2019 at 7:21 pm

Posted in mobile-growth

Mobile growth: authentication

with one comment

I define “authentication” broadly to cover assertion of app and user (including anonymous) identity.

The principle of least privilege can help us determine what type of authentication a given feature requires.

In general, I bias toward standards, namely OAuth 2, to avoid reinventing the wheel (and fixing the same bugs), especially with respect to security, where bugs can be very expensive.


A caller’s IP address is usually the baseline server-side identifier. We can use an IP address to derive a reasonable default location, for example.


Asserting the identity of an app is a hard problem. Malicious users can easily scrape identifiers out of an app instance, but we need to start somewhere.

Google’s “API key restrictions” are the closest I’ve seen to app authentication.


Now that we have an idea of which app is calling, we can identify the caller further by defining an “instance”. A simple approach is to just generate a random number or uuid, persist it in the client, and tolerate some collisions.

A slightly more complicated approach is to also generate and persist a secret, and register it with the service supporting the app, on installation, and then use a token derived from that secret ever after to identify the app. I like this approach because it still relatively cheap and makes an incremental step toward authenticating the caller.

Anything stored server-side and associated with an instance should require an instance token.


The next layer of authentication is the person using the app instance.

Many apps do not need a person to authenticate, but would benefit from growth features. A weather app that wants to A/B test new features would be an example.

Another subset of apps provide some functionality before a person authenticates and would like to ensure a continuous experience before and after a person authenticates. An example would be a comment widget that enables composition while logged out, but requires authentication before publication.

Anonymous state is generally device-specific as it’s much easier to transfer state between devices with a common user identifier.


Identifying a user can be as simple as asking for a username and password. Basing user authentication on email or phone can reduce the friction of inventing usernames and passwords, and provides a communication channel for things like account recovery. Federated authentication improves security through consolidation of account management, and can further reduce friction, so long as the user wants account consolidation.

We can pass an instance token in a user authentication request to provide a personalized experience incorporating what we know about the installation, for example.

Written by Erik

September 28, 2019 at 7:20 pm

Posted in mobile-growth