The Post-Incident Review Issue 3: April 2020

Three ways to enjoy the Post-Incident Review! Table of Contents

Featured Illustration

Illustration by @DeniseYu21, story from @saucinonyouuuu.


Featured Post-Incident Report: GitHub

"February service disruptions post-incident analysis"
GitHub
Published March 26, 2020

In late February, GitHub experienced multiple service interruptions that resulted in degraded service for a total of eight hours and 14 minutes over four distinct events. Unexpected variations in database load, coupled with an unintended configuration issue introduced as a part of ongoing scaling improvements, led to resource contention in our mysql1 database cluster.

Background
Originally, all our MySQL data lived in a single database cluster. Over time, as mysql1 grew larger and busier, we split functionally grouped sets of tables into new clusters and created new clusters for new features. However, much of our core dataset still resides within that original cluster.

We’re constantly scaling our databases to handle the additional load driven by new users and new products. In this case, an unexpected variance in database load contributed to cluster degradations and unavailability.

View the full post-incident report on the GitHub blog.




What Does Fairness Mean for On-call Rotations?

By Jaime Woo

Over the winter, Emil and I took a road trip for a friend’s housewarming. The drive was five hours—not bad with two of us able to take the wheel. My leg was easy, with little traffic and clear skies. For Emil’s turn, not so much. Three hours into the drive home, we hit a snowstorm. Visibility was poor and the road became noticeably slippery. Conversation in the car stopped.

Driving is stressful and exhausting during a blizzard. When we saw a sign for an upcoming rest stop, I asked Emil if he wanted me to take over. We didn’t know if the storm would intensify, straining him further as the driver. And, an emergency swap later would have to take place on the side of the road, much less safe than at a rest stop.

Depending on the task, there are many ways to divide labour. For a road trip, we expected to roughly spend equal amounts of time driving. (It’s a stronger metric than distance: after all, if we got stuck in a long stretch of traffic, we’d probably still have swapped.) Technically, if Emil took my offer to take over, I would then end up with more total driving time. However, that time was mostly smooth sailing, obviously lower impact than a winter storm.

When reflecting on how to characterize burden, it dawned upon us that many teams go through a similar calculus for incident response, in how to fairly assign on-call rotations. The most common method is to base rotations on fixed periods of time, but as even our simple example on the road highlights: is organizing shifts solely based on time the strongest strategy? We think factoring in the experience of on-call needs to be more rigorous and front-and-center.

Yes, there are reasons for basing on-call schedules on equal lengths of time. The regularity provides structure and certainty. Carrying a laptop everywhere you go is inconvenient, and having a set schedule means you can know when to book a medical appointment, or to go on vacation. But the need for certainty doesn’t only apply to knowing if you’re on-call or not.

But we don’t know when interruptions will happen. The random nature of interruptions means that you can’t guarantee an even distribution of load across a team. When schedules are inflexible the random nature of interruptions means that some people can end up taking the brunt. A schedule that equates fairness with time on shift can’t take that into account. One person could consistently encounter empty roads on a sunny day, while another tackles aggressive drivers during a hailstorm.

A hybrid approach that acknowledges the number of alerts and interruptions as well as time spent on-call would allow schedules to be fairer, and lead to healthier, more sustainable rotations. This might mean a cap on incidents, such as Google does, allowing only two incidents per 12-hour shift. Or, it may make sense that after someone has faced a disproportionate burden for interruptions their next shift is rescheduled until they’ve had enough time to recover. Another alternative is to ensure there are always primary and secondary on-calls, and that the two people swap roles to average out the labour.

The biggest challenge may well be cultural. On-call for many is like going to the dentist: something that needs to be done, but not to be thought of until the very last minute possible. (That also often means not wanting to change something if it doesn’t seem too broken.) Since the most visible limitation to on-call is where you can go, time-based rotations appear to be the solution. That unfortunately allows the elephant in the room to go unnoticed: the stress that’s involved with being on-call.

Take the current situation: most people are working from home, so they are always near their laptops with internet connectivity. And yet, physical accessibility isn’t the only thing that matters. During this pandemic, all of our stress levels are elevated, and little things seem to set us off much easier. Naturally, handling an incident suddenly has a larger toll; and, waiting a whole week to see if one or more strikes seems unfairly long. How do we ensure when they’re driving through a snowstorm that we allow an easy swap?

Now is the time to experiment with how we approach scheduling on-call because the physical limitations are temporarily gone. We are all home, and that means swapping between on-call shifts is in some ways easier. If Emil had been unable to drive, it’s obvious that my taking over would have been best for us both. Even if it meant giving Emil a breather, I would have taken over.

In the end, Emil felt determined to finish the drive (we’ll save a discussion around hero culture for another time) and we safely arrived home a few hours later. Yet even the offer to swap injected flexibility into the system. This is the time to adopt habits for healthier, fairer on-call. We shouldn’t be afraid to ask each other how we’re handling the road, and if the driver needs a break. In the end, we’re all in this journey together.



We’re dedicated to improving the on-call experience. Join the private beta for Ovvy Insights, our monitoring tool for on-call, at https://ovvy.io



The Incident Labs Bookshelf

Dreyer’s English” by Benjamin Dreyer
Random House, 2019

Why do you want to read this?
If you’re involved in post-incident reviews, you’re likely writing. Writing can be intimidating because of the many rules and stylistic quirks involved, but it needn’t be.

Ben’s book is a fantastic, enjoyable read—there are many literal laugh-out-loud moments—that will not only erase traumatic memories from English class, but will help you understand how to deploy words in a powerful, meaningful manner.

- Emil Stolarsky and Jaime Woo

Let us know what you think of Dreyer’s English on Twitter by tagging us @IncidentLabsInc




Enjoy this issue? Check out the entire catalogue.

Subscribe

Interested in receiving the next issue direct to your inbox? Enter your email here.


Enjoyed the Post-Incident Review? Checkout our other projects!

Printable version and bulk orders

Print the PDF double-sided yourself! Or, request bulk orders if you’d like to pick up enough copies for your team or org by shooting us an email at zine@incidentlabs.io!