The Post-Incident Review Issue 4: May 2020

Three ways to enjoy the Post-Incident Review! Table of Contents

Featured Illustration

Illustration by @DeniseYu21, story from @karlhigley. (It also has happened to Emil. :P)


Featured Post-Incident Report: Bose

"Bose QC 35 Firmware 4.5.2 Noise Cancellation Investigation Report"
Bose
Published April 2, 2020

Editors’ note: We normally share post-incident reports from online services, but nerded out over the Bose report. Here's a summary, plus a link to the full investigation report.

This investigation happened because after Bose released a firmware upgrade for its QC 35 headphones, users began writing in forums that it seemed to affect the noise cancellation feature. This took Bose by surprise because that feature wasn’t changed in the firmware update.

Bose did an initial engineering evaluation, doing an engineering deep dive into the firmware as well as customer headphone sampling and testing. Afterwards, they did an extended engineering evaluation, including third-party testing, in-home visits, and engineering analysis of returned headphones.

Even though they found no evidence of the firmware impacting the noise cancellation feature, seeing the write-up was an interesting breakdown of their process.

The PDF versions of the zine excerpt the extended engineering evaluation, and you can view the full post-incident report at the Bose community site.




How does your organization evaluate on-call?

You’re reading this zine, because you are as interested in incident response as we are. You may have noticed that too often on-call is “good enough” but is that actually good enough?

We're running a survey to understand how organizations create feedback loops to improve their on-call process.

Survey closes June 5. Add your voice at https://ovvy.io/survey/may-2020-oncall :)




Why being curious pays off: the time lightning struck a Saturn V after launch

By Jaime Woo

Something’s going wrong.

The Incident Commander is sounding the alarm. They need eyes. Is this the worst case scenario we’ve been dreading? All the monitoring data is gone, and the graphs are showing an unintelligible pattern.

You glance at the pattern, and for some reason it clicks in your brain. A year ago, you’d been observing a test run and noticed the same seemingly indecipherable pattern. Back then, you figured why not play around with it to discover how to recreate it.

You write to the IC an obscure suggestion. It’s so weird that for a second the IC replies with, “What?!” But, you repeat it, sure of what you are seeing. Time is ticking, and there are no other viable options, so the IC backs your suggestion.

Immediately, everything returns to normal. The mission to the moon continues as planned.

This isn’t a story from a tech company in 2020. This is Apollo 12. It was in 1969 at NASA, when lightning struck twice. On the rising Saturn V, 36.5 seconds into lift off and then again at 52 seconds, hitting the spacecraft and causing pandemonium. The lightning had caused a power surge and inadvertently disconnected the fuel cells, leading to a voltage drop.

NASA engineer John Aaron gave the obscure suggestion to “try SCE to Aux,” after recognizing the telemetry pattern from an anomaly he’d witnessed a year before during a test at the Kennedy Space Center. Shifting the signal conditioning electronics, or SCE, system to its auxiliary setting allowed it to operate in low-voltage conditions, restoring the telemetry.

The relevant question is what inspired John Aaron to dig around to uncover what caused that specific data signature? In an oral history with NASA, he credits a “natural curiosity with why things work and how they work.”

The lightning strike was a black swan event, and curiosity is a way to prepare for them. (Side note: Laura Nolan has a great talk on considering black swan events in SRE.) “[W]hen lightning strikes Apollo 12, I mean, we had never simulated that before,” notes Aaron.

“Our simulators were not even sophisticated enough that if we had, would it have necessarily produced the exact signature that I saw,” he added. “So only just by your research and ‘what if’ and contemplation and thinking about things and try to think of all, do you prepare yourself for that kind of event.” Aaron’s curiosity helped NASA avoid having to choose from a list of unsavoury options.

Curiosity has been on Emil and my mind lately. In April, we ran a workshop on SRE fundamentals, and a question we wanted to tackle was what traits you might find in someone working in SRE.

We were reminded of a conversation with a friend in Dublin who shared how she was the type to keep asking why about the systems she worked with. That echoes John Aaron talking about how he always wanted to know how things around him worked, and not stopping until he had a deep understanding. It dawned on Emil and I that curiosity was something in common in the SREs we knew.

That willingness to learn makes sense, given the need to work with complex systems. The systems change constantly, and the role requires someone wanting to ask questions about how they work. The inquisitivity means rather than seeing one specific part of the system as their domain, SREs instead wonder about all the parts of the system, and how they function together.

But, it’s not just the technical system. SREs need to be curious about people too, the socio- part of the sociotechnical system. Without that, you couldn’t bring different teams together to create meaningful SLOs. You couldn’t navigate personality types to properly respond to incidents. You’d be satisfied with just the five whys and miss out on uncovering the lessons to be learned post-incident.

You may wonder then, does everyone have curiosity? Of course. Curiosity is within all of us; however, unfortunately, many of us have had negative experiences that dull it. You may say: Yes, I was curious as a child, but I’ve lost it. Can that be regained? Absolutely. So, how can we foster more curiosity? Skimming the research, there are consistent ideas.

Curious people ask questions, and are also active listeners. The first part is perhaps easier, but the power is in the second part of processing what people are telling you—and then asking insightful follow-up questions.

They aren’t afraid to be wrong in their journey. Curiosity isn’t about being perfect. Because you can’t learn without making mistakes (as much as we may wish it were different).

They aren’t scared to admit that they don’t know something. Curiosity isn’t about worrying that someone might know more than you. People concerned about that don’t ask too many questions, in case they get proven wrong!

Finally, curious people also create space to practice curiosity. Just having it within isn’t enough. They allow and prepare themselves to be curious, even if in the past being curious hasn’t led to a positive response.

For Emil and I, this includes allowing almost any conversation between us to be paused if one (or, sometimes, both) of us have to head to Wikipedia because our curiosity got sparked by something the other said—the Apollo 12 story came from Emil, and led to at least an hour of me reading up on the incident.

Another example is that we often propose ideas with the opener of “So what if we…” and nine times out of ten the other person will respond with “Who cares? Let’s try it!” The zine you’re reading is a result of exactly that openness. “So what if we took post-incident reports and laid them out like fancy art journals?” Our software project Ovvy is another. “So what if we built tooling that helps sort pages better to reduce toil?”

Unfortunately, many organizations, intentionally or unintentionally, diminish curiosity. They get scared that indulging in curiosity will distract employees, or misrepresent expectations. That’s a shame, because research suggests there are many benefits, including more innovation, less conflict, and greater communication. In a world dominated by knowledge work, and everything becoming increasingly unknowable, boundless curiosity might just end up saving whatever our own version of Apollo 12 looks like.



Want more writing about site reliability engineering? Subscribe to our newsletter, SRE for Mere Mortals, at https://incidentlabs.io



Thank You

Emil Stolarsky and Jaime Woo

When we started the Post-Incident Review, we managed our expectations. Incident response is fascinating, especially if you include the preparation before and analysis afterwards.

The readership for Post-Incident Review is now regularly hitting four digits. It was completely unexpected, but it’s a testament to the power of believing we can improve, and learn, and grow and to make things better.

So thank you!




Fix On-call and Restore Your Engineering Capacity

Bad pages are like weeds: left untended, just as weeds will take over your lawn, unactionable and noisy pages end up overwhelming your on-call.

We’re building Ovvy Insights because you should have the right data to improve on-call. Your team understands your system best, so we combine their expertise with PagerDuty data to show you where in your system to prioritize.

Reduce toil and regain engineering capacity for healthier, more productive teams. Head to https://ovvy.io for full details!




Enjoy this issue? Check out the entire catalogue.

Subscribe

Interested in receiving the next issue direct to your inbox? Enter your email here.


Enjoyed the Post-Incident Review? Checkout our other projects!

Printable version and bulk orders

Print the PDF double-sided yourself! Or, request bulk orders if you’d like to pick up enough copies for your team or org by shooting us an email at zine@incidentlabs.io!