An interesting bit of news, tangentially related to distributed systems reliability. Or directly related, if you’re like me:

This is kinda a big deal. This is monitoring and alerting, for airplanes. Like, all of them (in the US at least). Big 777s, tiny Cessnas, all pilots and airlines use NOTAM to understand risks present on their routes.
Here’s an example NOTAM from my local airport:
NOTAM 3/2522: San Luis County Regional Airport (KSBP)
!FDC 3/2522 SBP SID SAN LUIS COUNTY RGNL, SAN LUIS OBISPO, CA. WYNNR FOUR DEPARTURE ... FELLOWS TRANSITION NA EXCEPT FOR AIRCRAFT EQUIPPED WITH SUITABLE RNAV SYSTEM WITH GPS, FLW VOR OUT OF SERVICE. 2301100335-2302170335EST CREATED: 10 Jan 2023 03:35:00 SOURCE: KDZZNAXX 1
The failure was not met by panic (as far as I can tell), but by cool communication from the FAA. This is not an accident, this is the result of years (decades!) of practice. My guess is that this tweet was pre-written in a playbook somewhere. The FAA loves playbooks (or checklists, whatever).
So, the outcome here is that monitoring was down and they needed to tell everyone that. How do you tell everyone that the thing that tells everyone things is not working? In this case, twitter and the press. Maybe other methods. This process, knowing that monitoring is not working, not trustworthy, maybe incomplete in some way is known as meta-monitoring. Even being aware that this is a thing is pretty important.
As a friend noted today in a Slack, “system you have never heard of that hasn’t failed in like 30 years failed today” — this is a great way to look at it.
Systems fail, you have to accept that. Even something that has been working for 30 years, it will fail. It will also (probably) recover. But for some epsilon of time, you need to keep moving without it.
What’s your plan? What do you assume will always be there? What happens when it’s not. Do you have a tweet pre-written?
Don’t worry, this makes as much sense to pilots as a pagerduty alert from prometheus alerts do you and me. 🤯