MLCC: Gradient descent

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

Earlier, I explored simplistic linear regression, largely based on [1]. The next section of the crash course ([2]) dives into “gradient descent” (GD), which raises the question “What’s wrong with the linear regression we just learned?” In short, the technique we just learned, Ordinary Least Squares (OLS), does not scale.

[3] clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where “all of the data must be available and you must have enough memory to fit the data and perform matrix operations” and uses least squares to find the best line. GD concerns “a very large dataset either in the number of rows or the number of columns that may not fit into memory.” As described by [4], OLS doesn’t scale. GD scales by finding a “numerical approximation … by iterative method”.

[2] introduces GD by descending a parabola, but it’s unclear how we transitioned from talking about straight lines in [1] to parabolas. The distinction is that we’re now focusing on loss functions. (To be fair, in retrospect, the title is “Reducing loss”🤦‍♂️) [2] asserts “For the kind of regression problems we’ve been examining, the resulting plot of loss vs. w1 will always be convex”, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.

[1] introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function. [5] clarifies that Mean Squared Error is an exponential function. This provides helpful context for [2]’s definition of “gradient” as the derivative of the loss function.

I like the summary statement from [5]

The goal of any Machine Learning Algorithm is to minimize the Cost Function

[5] uses the interactive exercise from [2]. It’s reassuring to see convergence 😉

[4] presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.

This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. It’d be better phrased as trying to find the deepest valley. Anyway, it states “Our aim is to reach the minima which is the valley bottom. So our gradient should be negative always … So if at our initial weights, the slope is negative, we are in the right direction”, which explains the “descent” in “gradient descent”.

[4] (like [2]) describes three forms of GD:

  1. Batch
  2. Stochastic
  3. Mini Batch

[2] defines “a batch” as “the total number of examples you use to calculate the gradient in a single iteration.” Presumably, it’s referring to Batch GD when it says “So far, we’ve assumed that the batch has been the entire data set.”

[2] describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.

[2] states Mini Batch “reduces the amount of noise in SGD but is still more efficient than full-batch” because it uses batches of 10-1000 random examples, and that Mini Batch is what’s used in practice.

When do we stop iterating? [2] states “you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.”

To summarize:

  1. Initialize with arbitrary weights
  2. Generate a model
  3. Sample (labeled) examples
  4. Input sample into the model
  5. Calculate the loss
  6. Compare the new loss with the previous loss
  7. If loss is decreasing
    1. Add the step value to the weight
    2. Repeat from step 2


  1. Google Machine Learning Crash Course: “Descending into ML”
  2. Google Machine Learning Crash Course: “Reducing loss”
  3. Machine Learning Mastery: “Linear Regression for Machine Learning”
  4. Towards Data Science: “Optimization: Ordinary Least Squares Vs. Gradient Descent — from scratch”
  5. Towards Data Science: “Understanding the Mathematics behind Gradient Descent”

MLCC: Linear regression

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] and [2].

A lot of ML quickstarts dive right into jargon like model, feature, y’, L2, etc, which makes it hard for me to learn the basics – “what are we doing and why?”

The crash course also presents some jargon, but at least explains each concept and links to a glossary, which makes it easier to learn.

After a few days of poking around, one piece of jargon seems irreducible: linear regression. In other words, this is the kind of basic ML concept I’ve been looking for. This is where I’d start if I was helping someone learn ML.

I probably learned about linear regression in the one statistics class I took in college, but have forgotten about it after years of string parsing 🙂

The glossary entry for linear regression describes it as “Using the raw output (y’) of a linear model as the actual prediction in a regression model”, which is still too dense for me.

The linear regression module of the crash course is closer to my level:

Linear regression is a method for finding the straight line … that best fits a set of points.

The crash course provides a good example of a line fitting points describing cricket chirps per minute per temperature:

Google's example of a line fitting cricket chirps by temperature

The “linear” in “linear regression” refers to this straight line, as in linear equation. The “regression” refers to “regression to the mean”, which is a statistical observation unfortunately unrelated to statistical methods like the least squares technique described below, as explained humorously by John Seymour.

Math is Fun describes a technique called “least squares regression” for finding such a line. Google’s glossary also has an entry for least squares regression, which gives me confidence that I’m bridging my level (Math is Fun) with the novel concept of ML.

Helpful tip from StatQuest’s “Machine Learning Fundamentals: Bias and Variance”: differences are squared so that negative distances don’t cancel out positive distances.

Math is Fun’s article on linear equations and the crash course’s video on linear regression reminded me of the slope-intercept form of a linear equation I learned about way back when: y = mx + b.

The crash course even describes this equation as a “model”: “By convention in machine learning, you’ll write the equation for a model slightly differently …”

All this helps me understand in the most basic sense:

  • A “model” is just an equation
  • “Training” and “learning” are just performing a regression calculation to generate an equation
  • Performing these calculations regularly and on large data sets is tedious and error prone, so we use a computer, hence “machine learning”
  • “Prediction” and “inference” are just plugging x values into the equation


  1. Google Machine Learning Crash Course: “Framing”
  2. Google Machine Learning Crash Course: “Descending into ML”

The Manager’s Path

I’m an individual contributor, but I want to better understand management’s concerns, so I’m reading Camille Fournier’s excellent The Manager’s Path. These are my notes.

Many think a neutral relationship with management is good because at least it’s not negative, but there is such a thing as a positive relationship w mgmt.

1-1 mtngs:

  • two purposes:
    • Human connection
    • Private conversation, eg feedback
  • I agree w the above two, and would add a third:
    • To ensure time w otherwise busy ppl; the junior person has the priority
  • Not for status
  • Prepare an agenda. I like a living doc, linked from the mtng invite. Items can be added and referenced any time
  • “Regular 1-1s are like oil changes; if you skip them, plan to get stranded …”
  • “Try to keep notes in a shared document” 👍 I like to link an agenda doc from the mtng invite. (Same for most recurring mtngs.)

As you become more senior, feedback decreases.

Appreciate the fact that current peers turn into future jobs.

– common every 5-10 yrs
– lots of uncertainty in the world
– ultimately, we have to rely on ourselves

People aren’t good at saying what they mean in a way others can understand, so we have to listen carefully to words, and non-verbal cues indicating the person feels understood.

“Be prepared to say anything complex a few times and in diferent ways.” I’ve found such repetition frustrating in the past. It’s validating to see this advice.

Effective teams have good onboarding documents. Have new hires update the docs as their initial contribution.

“What you measure, you improve.”

Beware alpha-geek tendencies. In particular, the tendency to lecture and debate.

Mentorship skills:
– keep an open mind, since the mentee brings fresh eyes
– listen and speak their language. If you can’t hear the question being asked, you can’t provide good answers
– use the mentorship to build your network

“Tech lead is not a job for the person who wants the freedom to focus deeply on the details of her own code.”

“… the tech lead role may be held by many different stages of engineer, and may be passed from one engineer to another without either person necessarily changing his functional job level.”

“… we know from the title that it is expected to be both a technical position and a leadership role.” In other words, it’s not necessarily superlative, ie TL != best.

“The tech lead is learning to be a strong technical project manager… and [is] learning how to handle difficult management and leadership situations”

“Realistically, it is very hard to grow past senior engineer 2 without ever having acted as a tech lead, even on the individual contributor track… people skills are what we’re asking the new tech lead to stretch, more than pure technical expertise.” This stands out to me because of the tension between manager and maker modes, to use Paul Graham’s terminology.

“Being a tech lead is an exercise in influencing without authority …” Including building a psychological skill set for managing associated stresses.

“From now on … balancing is likely to be on of your core challenges.”

Currently, it feels like I’m working two jobs, manager mode during the day, and maker mode in morning and evenings. Regular, project-specific “cadence” meetings have helped reduce ad hoc discussions, fwiw.

Ah, a few lines later: “Some days you’re on maker’s schedule, and some days your on manager’s schedule…It’s very difficult to get into the groove of writing code if you’re interrupted every hour by a meeting.”

“Part of your leadership is helping the other stakeholders … respect the team’s focus and set up meeting calendars that are not overwhelming for individual contributors.” I’m very happy to see this in a book about managing thought workers.

Main roles of tech lead

  • Architect and business analyst. Design the system enough to provide estimates and ensure requirements are met
  • Project planner. The goal is to maximize parallelization
  • Developer and lead. Write code, but not too much. The goal is the project (and team development), not individual tasks

“Sometimes tech leads are tempted to go to heroics and push through obstacles themselves… [but] you should communicate the obstacle first.” I can relate with the former and appreciate the actionable latter.

“Teams often fail because they overworked themselves on a feature their product manager would have been willing to compromise on.” So, communicate.

“… most managers will expect their tech leads to continue writing as much code as before … It’s generally a pure increase in responsibility …”

The goal of a project plan is a “degree of forethought, in places where you can reasonable make predictions and plan … The plan itself … is less important than the act of planning.”

Take time to explain. No one who’s not actively working on a project should be expected to immediately know and understand project details.

Do a premortem as part of project planning. How could the system fail, and what could we do to recover?

“Having the focus to build something big yourself is a distant memory.”

The agile principles can be a healthy alternative to rigid process 👍 I think they’re great.

“… no two great teams ever look exactly alike in process, tools or work style” The best thing I’ve seen is an appreciation of experimentation and iteratively building a style that works for the current team. A basic project plan, ie list of tasks, also seems like a universal business requirement. Put another way, revisiting that plan periodically seems like a reasonable, universal starting point.

Qualities of a great tech lead:

  • Understand the architecture
  • Help build, but involve others
  • Lead decisions, but do so collaboratively
  • Communicate

“You want to encourage others on your team to learn the entire system … but you don’t always need to be self-sacrificing” There’s the need for a sense of balance again.

“Your productivity is now less important than the productivity of the whole team.” But how to improve the productivity of the team without putting on a management hat? Fournier gives an example: “Represent the team in meetings.”

Possession of communication skills differentiates successful leaders.

“Practice repeating things back to people to ensure you understand them.” I like this! I think it pairs well w advice earlier in the book to listen and observe non-verbal cues.

Communicate and listen.

I’d add that the tech lead label can also make one a focal point for questions, eg support, which can disrupt focus work. I like the pattern of having a support rotation, but depending on the company, the convention may be to simply ping the TL.

“Respect the ‘maker schedule’ for reports” 👍 As a general rule, I appreciate biasing toward contiguous meeting blocks.

Autonomy … is an important element of motivation.” I see this w external contributions too. Maximizing an integrating team’s autonomy frees them to meet their goals w minimal bottlenecks.

Deep Work by Cal Newport 📖

I’ve found I need uninterrupted time to focus for software projects at work, and actually leave work unsatisfied if I’m unable to find this time. While investigating what this is all about, I came across Newport’s Deep Work.

The first half of the book makes a case for investing in focus time, especially for folks performing “knowledge work”, which can require workers to hold many details in mind to produce significant output. The second half of the book provides practical recommendations for how to set aside uninterrupted time.

Focus time

Like Paul Graham’s Maker’s Schedule, Manager’s Schedule, the first recommendation is to explicitly identify and reserve time for focused work, which I found validating.

In practice, I find a couple common sources of distractions:

  • random requests and discussions via email, chat, etc
  • random requests and discussions around my desk area

The book recommends avoiding interruptive electronic communications, especially social media, when trying to focus, and makes a case that most interruptions can wait.

Regarding in-person interruptions, the author mentions sequestering himself in a library. I might try something similar.

One challenge I see: focus requirements might be dependent on when as well as what. For example, the early days of a project might be meeting-heavy as requirements are clarified, and then focus-heavy as pieces are built.

My current experiment is to reserve 8-10 and 2-5 for focus work. During these times, I’ll sit away from my desk and ignore chat and email. This leaves a couple hours before and after lunch to catch up on electronic communication, have spontaneous discussions and provide a reasonable window for scheduling meetings.

The book cites research indicating our ability to work deeply maxes out around four hours, which also aligns with my experience. I previously experimented with reserving 1-5, but found it was too restrictive for natural interactions with colleagues, and I didn’t need the entire time anyway.

In an effort to provide “ownership” for my project, I took pains to be aware of all changes and support requests. This obviously doesn’t scale, but I appreciated reading a supportive quote from Tim Ferris:

Develop the habit of letting small bad things happen. If you don’t, you’ll never find time for the life-changing things.


An interesting aspect of this book is it’s not simply presenting thoughts on the topic of deep work; it was motivated by the author’s need to optimize productivity. In this context, the book introduced me to 4DX, which is like agile boiled down to four points. As with agile, I find myself referencing it often.

In particular, I appreciate the emphasis on prioritization; during focused effort, we say “no” to many things so we can say “yes” to the most important things.

The book also mentions the idea of “results-driven reporting”, which defers meetings until there’s something new to report, as an alternative to status updates. This aligns with recommendations to prioritize requirements and blockers over status updates in cadence meetings from a recent talk on politically-charged projects.

Politically-charged projects

I attended a talk yesterday that shared best-practices from managing two large projects that suffered from competing priorities:

  • prioritize, prioritize, prioritize <– reminds me of 4DX
  • put goals, tasks, vocabulary, agreements, engineering guidance, etc in writing to clarify communication
  • use pilot programs to clarify requirements
  • hold regular cadence meetings <– reminds me of 4DX
  • focus cadence meetings on clarifying requirements and getting help more than status updates
  • have periodic summits to build team cohesion
  • invite folks from different teams and with different roles to the summits to get a diversity of perspectives
  • identify and support “influencers” in the teams you need help from

One of the problems involved signing a contract before performing any engineering feasibility analysis and then having to turn the org on a dime to meet the deadline. 🦶🔫 I have experience being on the other side of abrupt changes and wondering what was going on, so it was validating to hear the context.

Mobile Growth meetup 9/21/17


Branch runs a nice mobile growth meetup I’ve attended a couple times. The one last night was in the Microsoft office (formerly Yammer) in the Twitter building.

Credit to Prakhar for asking questions about these notes that led to more clarifying detail.



What’s worked?

  • Using a wait-list to alleviate cold start; complete profile to advance in list
  • Providing VIPs with promo urls that point at their profile. This drives downloads and enables warm signup
  • Targeting individuals for trip-appropriate travel ads based on their check-out dates
  • Providing teen demographic with feedback features, eg "likes", increased retention by 5%

How to get users?

  • Maximize free, organic stuff first, as opposed to buying keywords, then layer "marketing mix" (paid marketing channels) on top (to get "halo effect"), ie pr > ads
  • All news is good news in early days. Being exclusive is ok. People complaining is ok
  • Facebook ads accounted for 20% traffic
  • Have 2-3 marketing channels to account for fluctuating performance. Continuously try new channels

What didn’t work?

  • Test performance of pics on app store listing
  • Celebrities are well known, so using their pics is tempting, but usage without permission implies endorsement and they may take action

Reengagement & overlooked metric?

  • App quality [1]
  • Minimize registration requirements. How much info can you capture later? Reducing one field can have a big impact. Prioritize input hints and assistance before paid marketing
  • Try requesting push earlier; not first, but not last, eg so you can push "We didn’t mean $3.99. We meant $2.99"
  • Ask for easy things first, which will help people feel invested and more likely to grant hard things later

Metrics to obsess over?

  • Product quality
  • Predictive churn
  • Make it hard to cancel, eg at least ask why

Thoughts on iOS 11?

  • In-app purchases process is better
  • Live photos, which are easier to produce than video and more compelling than still
  • Getting featured in app store is no longer make/break for business

Cause of FB acquisition performance change?

  • This was regarding "How to get users?" answer above mentioning performance fluctuation
  • Unclear, but timing corresponded with new FB interstitial when exiting app, eg to app store

Snapchat ads?

  • To early to tell

Google’s UAC campaign?

  • One panelist didn’t use Google ads because FB CPI is lower

How to reengage users who don’t create account?

  • Low involvement indicates low intent and will be expensive to reengage
  • Request push earlier in registration
  • Collect retargeting info on app install and then use ads to drive registration completion

How to AB test frequently?

  • This was regarding Laughly’s two-week experiment cycle
  • Only test one thing at a time. Literally, only one variation in the app every two weeks (to reduce noise) [2]

Top recommendation?

  • Experiment & fail fast
  • Prioritize feature requests from users
  • Test new marketing channels
  • Acquisition & retention are the same


[1]: There wasn’t a specific metric mentioned. The general idea was: invest in app quality before driving traffic to app, ie if an app’s unusable, no amount of growth tuning will retain users.

[2] This was my top takeaway. Presumably this also reduces engineering complexity and improves UX consistency. The pitch was purely about logical correctness in experiment construction, but the person who asked the question mentioned their experiments take months to run, which would seem to indicate a significance (or quality) concern. I also appreciated the conceptual simplicity. I suppose a follow-on requirement is to have a smaller eng org, so folks don’t feel blocked by limited release opportunities. In my experience, we tried to scale eng by running multiple experiments simultaneously, but the tech required to support this was complex, to the point where I’m now looking back and wondering if we should have just done less 🙂

Notes from Bill McNabb’s talk at Twitter 2017-07-27

Bill McNabb is CEO of Vanguard, which promotes gender diversity internally and at companies they own shares in. I thought the rationale he described was eloquent and applicable in general. Three points stood out.

First, the goal is corporate performance. Diversity provides material value.

Second, the probability of having all effective leaders in a group partitioned by gender is lower than in an unpartitioned group.

Third, research findings (example) support the hypothesis that a diverse board yields higher performance.

Notes from “Web Application Development with Closure Compiler” talk by Alan Leung on 6/22/11

Alan visited Twitter on 6/22 and presented an introductory talk on Google’s Closure compiler for JavaScript. Alan is tech lead on Closure team.

Here are the slides:


* JavaScript was originally designed for small DOM operations. Now that we’re building large-scale apps in JS, we can use some help.
* Google uses Closure for all but a couple products
* The Closure compiler can perform ~55 optimization passes, including linting code, validating function definitions, performing gzip-optimized compression, trimming dead branches
* Closure can also provide compile-time constants, e.g., “if(INTERNAL){…”, and trim unused branches that result
* Closure uses a graph coloring heuristic for variable renaming


Notes from Kyle Neath’s presentation at Twitter on 5/31

  • Slides
  • hashbang urls
    • are a kludgy workaround for lack of history api. Since history api is coming, they have no future. Since urls are forever, especially w/ tweets being stored in the lib of congress, use of hashbangs results in permanent support for a temporary condition.
    • break pre-existing url fragment behavior
    • result in confusing routing logic
  • “responsive web design” is adapting to client and seeming responsive to user input
    • page load isn’t just a benchmark; a page is only “loaded” when the user can scroll, read text, and click links
  • well-designed urls provide a command-line-like interface for web apps
  • all web assets should have a url, i.e., navigation should not allow access to a resource that cannot then be accessed directly via a url
  • native elements should behave as the user expects
    • do not modify common key combos, e.g., shift + click
    • take advantage of the back button, tabs, links, etc
  • responsiveness is as much about performance as perception
    • wait ~500ms before showing loader image; showing loaders immediately can actually make the page seem slower
  • ssl
    • is required now that there are common, easy ways to sniff credentials
    • a new ssl handshake is very slow, and required for each domain
    • use http keep-alive to reuse ssl connections
    • multiple parallel requests to a new domain will each have to perform a handshake; instead, complete one fast request, and then reuse the connection for subsequent parallel requests
    • github optimized its backend to 40ms latency before realizing that the ssl handshake takes 500ms
      • a case of perception > performance
      • favor science over theory, i.e., test time-to-usable in multiple regions instead of just running perf tests on components
    • templates
    • use something simple, e.g., mustache
    • avoid rendering on client and server; pick one
    • kneath prefers server-side
    • for server-side rendering, passing html back as one value in a json object allows for passing data back in other keys
  • html 5 history api
  • allows for much richer state management. See github’s new issues dashboard

Notes from Neil Gershenfeld’s 5/24 talk at Twitter

My mind was just blown by a talk from Neil Gershenfeld, director of the Bits and Atoms lab at MIT. His team created the fab lab. Here are  some notes