Learning From Incidents, as an SRE
Intersecting established communities can result in good questions!
I spent a few days in Denver at the inaugural LFIconf. I presented a talk on how Google learns from incidents. I’ve been thinking about the experience and watching the talks that I missed on the YouTube playlist.
A few highlights:
John Allspaw’s closing keynote, highlighting how the LFI process can actually work at an actual real company (exemplars are important!):
Dr Pupulidy discussing how LFI worked in aviation safety, talked about the process of sensemaking, goal conflicts, situational awareness in complex systems. Very compelling:
Dr. David Woods brought together a lot of history and resources that I’ve heard about but haven’t yet absorbed. This talk was a LOT, I think I understood 8% of the content. I am still thinking about it:
Jessica Kerr from Honeycomb had a talk wherein she highlights the virtues of flexibility over structure when it comes to incident reports (or postmortems). This challenged some of my thinking and I really appreciate it!
Pirmin Schuermann’s talk on becoming “The Resistance” was inspiring and fun:
And I have to plug my own talk. This is about the “programitization” of learning from incidents at Google, which is challenging at Google-scale.
Overall, it was a delight to delve into this new-to-me community and find how it overlaps with my Google-SRE understanding of dealing with and learning from Incidents.
At the end of the day, the line that really sent it home for me was:
Good job Jeli folks. That was a fun conference!