Software engineering notes

Archive for the ‘book’ Category

Site reliability

leave a comment »

Google’s SRE handbook, summarized in The Calculus of Service Availability, and the accompanying Art of SLOs workshop materials, are great. Here are a few things that stand out to me.

SLI vs SLO vs SLA

As defined in the chapter on Service Level Objectives:

  • SLI = Service Level Indicator. In other words, a specific metric to track. For example, the success rate of an endpoint. Have as few as possible, to simplify reasoning about service health.
  • SLO = Service Level Objective. A desired SLI value. For example, a 99% success rate.
  • SLA = Service Level Agreement. A contractual agreement between a customer and service provider defining compensation if SLOs are not met. Most free products do not need SLAs.

The “Indicators in Practice” section of the SLO chapter provides some helpful guidelines about what to measure:

  • User-facing serving systems –> availability, latency, and throughput
  • Storage systems –> latency, availability, and durability
  • Big data systems –> throughput and end-to-end latency

In the context of less is more, note each domain has 2-3 SLIs.

Reasonable SLOs

Naively, I’d think a perfect SLO would be something like 100% availability, but the “Embracing Risk” chapter clarifies all changes have costs. Striving for 100% availability would constrain all development to the point where the business might fail for lack of responsiveness to customer’s feature requests, or because it spent all its money on monitoring.

Additionally, customers might not notice the difference between 99% and 100%. For example, “a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!”

Related, a dependency’s SLOs can also provide a guideline. For example, if my service depends on a service with 99% availability, I can forget about 100% availability for my service.

I find downtime calculations (eg uptime.is) helpful for reasoning about appropriate SLOs:

  • 90% = 36d
  • 99% = 3d/yr
  • 99.9% = 8h/yr
  • 99.99% = 52m/yr
  • 99.999% = 5m/yr

The Art of SLOs participant handbook has an “outage math” section that provides similar data, broken down by year, quarter and 28 days.

So, if our strategy is to page a person and have them mitigate any issue in an hour, we might consider a 99.99% SLO. A 5 minute mitigation requirement is outside human ability, so our strategy should include something like canary automation. In this context, a small project involving a couple people on a limited budget should probably consider a goal of 90% or 99% availability.

I found it helpful to walk through an example scenario. The Art of SLOs participant handbook provides several example “user journeys”. For example, I work on a free API developers consume in their apps. This fits the general description of a “user-facing systems”, so availability, latency, and throughput are likely SLIs. Of these, most support requests concern availability and latency, so in the spirit of less is more, I’d focus on those.

I have an oncall rotation, pager automation (eg PagerDuty), and canary automation, but I’m also building on a service with a 99% availability.

We can reasonably respond to pages in 30 minutes, and fail out of problematic regions within 30 minutes after that, but we also have occasional capacity issues which can take a few hours to resolve.

So, 99% seems like a reasonable availability SLO.

A latency SLI seems more straightforward to me, perhaps because it can be directly measured in a running system. One guideline that comes to mind is the perception of immediacy for events that take less than 100ms.

Written by Erik

December 21, 2019 at 10:35 am

Posted in book, tool

Deep Work by Cal Newport 📖

with one comment

I’ve found I need uninterrupted time to focus for software projects at work, and actually leave work unsatisfied if I’m unable to find this time. While investigating what this is all about, I came across Newport’s Deep Work.

The first half of the book makes a case for investing in focus time, especially for folks performing “knowledge work”, which can require workers to hold many details in mind to produce significant output. The second half of the book provides practical recommendations for how to set aside uninterrupted time.

Focus time

Like Paul Graham’s Maker’s Schedule, Manager’s Schedule, the first recommendation is to explicitly identify and reserve time for focused work, which I found validating.

In practice, I find a couple common sources of distractions:

  • random requests and discussions via email, chat, etc
  • random requests and discussions around my desk area

The book recommends avoiding interruptive electronic communications, especially social media, when trying to focus, and makes a case that most interruptions can wait.

Regarding in-person interruptions, the author mentions sequestering himself in a library. I might try something similar.

One challenge I see: focus requirements might be dependent on when as well as what. For example, the early days of a project might be meeting-heavy as requirements are clarified, and then focus-heavy as pieces are built.

My current experiment is to reserve 8-10 and 2-5 for focus work. During these times, I’ll sit away from my desk and ignore chat and email. This leaves a couple hours before and after lunch to catch up on electronic communication, have spontaneous discussions and provide a reasonable window for scheduling meetings.

The book cites research indicating our ability to work deeply maxes out around four hours, which also aligns with my experience. I previously experimented with reserving 1-5, but found it was too restrictive for natural interactions with colleagues, and I didn’t need the entire time anyway.

In an effort to provide “ownership” for my project, I took pains to be aware of all changes and support requests. This obviously doesn’t scale, but I appreciated reading a supportive quote from Tim Ferris:

Develop the habit of letting small bad things happen. If you don’t, you’ll never find time for the life-changing things.

Productivity

An interesting aspect of this book is it’s not simply presenting thoughts on the topic of deep work; it was motivated by the author’s need to optimize productivity. In this context, the book introduced me to 4DX, which is like agile boiled down to four points. As with agile, I find myself referencing it often.

In particular, I appreciate the emphasis on prioritization; during focused effort, we say “no” to many things so we can say “yes” to the most important things.

The book also mentions the idea of “results-driven reporting”, which defers meetings until there’s something new to report, as an alternative to status updates. This aligns with recommendations to prioritize requirements and blockers over status updates in cadence meetings from a recent talk on politically-charged projects.

Written by Erik

October 6, 2019 at 6:41 pm

Posted in book