Generic Mitigations in Site Reliability Engineering
In incident response, every problem is unique. But that doesn’t mean that every problem requires a unique response. On the contrary, if you spend too much time trying to develop a bespoke manual response to every incident that you need to resolve, you’re unlikely to be able to fix anything very quickly.
This reality has spawned conversation (most notably in a recent article by Jennifer Mace, a Senior SRE at Google) around the topic of “generic mitigations.” A generic mitigation is an easy-to-implement fix that can solve a wide range of problems.
In addition to offering a practical strategy for minimizing MTTR, generic mitigations reflect the broader importance of taking a pragmatic, automated approach to incident response overall. They’re an example of the type of thinking that SREs and IT engineers need to embrace today to develop reliability management strategies that work in practice and not just in theory.
What’s a Generic Mitigation?
You can read Mace’s article for the full details, but in a nutshell, a generic mitigation is an action that can solve a wide variety of problems rather than just one specific kind of problem.
Mace points to application rollbacks as a common example of generic mitigations. Rolling back a release to a known-stable version is a simple way to solve a whole range of potential issues with your current release. Rather than having to spend time figuring out exactly what’s wrong with the release and how to address it, you just roll back to restore reliability.
As Mace points out, actions like upsizing resource allocations, quarantining bad service instances, and blocking a user or host that is causing a problem are other good examples of generic mitigations.
In addition to being easy to perform, generic mitigations lend themselves well to automation. When a simple action solves a range of problems, you can typically set up a tool to perform that action for you in a reliable way, regardless of the specifics of the incident at hand.
Are Generic Mitigations Too Automated?
In some ways, Mace’s arguments seem blatantly obvious. Of course SREs should solve problems as quickly as possible.
But on the other hand, it’s easy to feel like something is a little off with the idea that a generic mitigation – especially a fully automated one – is the best response to an incident. That’s probably because SREs and incident response teams are inundated with reminders about how important it is to “get to the root cause” of each issue, achieve “continuous improvement” so that the same type of incident doesn’t recur, and so on. They may also feel like their engineering skills are wasted if they rely heavily on automated tools to solve problems for them.
From this perspective, the idea that the best response is the simplest and bluntest fix may seem a bit counterintuitive. How are you supposed to understand the root cause and prevent future occurrences, after all, if you don’t carefully analyze each incident and develop a response plan customized to it? If you’re being paid six figures to manage reliability, why would you automate mitigations of complex problems? And aren’t quick fixes dirty fixes?
Well, not necessarily. There’s a solid argument to be made (and Mace makes it) that the number-one priority in incident response should be to minimize the amount of disruption that your users experience. Whichever response helps you minimize MTTR, then, is the best response. Gaining a deep understanding of an issue before you resolve it doesn’t win you any points if you could have resolved it faster via a generic mitigation.
It certainly doesn’t help you achieve your SLAs, either. The only thing that matters there is what you actually achieve, not how you achieve it. Good luck telling your users, “yes, we failed to uphold our SLA guarantees, but please know that it was only because we are so committed to manual root-cause analysis.” Your users don’t care about root-cause analysis. They only care about uptime and performance.
Plus, it’s not as if generic mitigations and careful analysis of an incident are mutually exclusive. You can, and should, perform the deep analysis you need to understand what went wrong after you use a generic mitigation to solve an issue. But you need not wait until you have that understanding to resolve the problem. Or, even better, you can rely on automated tools to perform root-cause analysis for you, even if you also use a generic mitigation to fix it.
Generic Mitigation and Site Reliability Management
This brings us to some larger points about how the concept of generic mitigations reinforces best practices in reliability management as a whole.
Traditionally, a lot of talk about incident management has focused on what you should theoretically do, not what most teams actually do. You should theoretically have carefully orchestrated playbooks for each and every type of incident that could ever possibly befall your systems. In reality, no one can achieve that.
You should theoretically triage incidents so that the ideal person responds each time. In reality, it’s pretty tricky to figure out who the ideal response candidate is until after the issue has been resolved.
In theory, you should make sure that your developers, SREs, and IT engineers all share equal ownership in the incident response process. In reality, it can be difficult to effectively integrate developers, largely because the language and tools of incident response have traditionally been foreign to them.
What I’m getting at is that a healthy incident response strategy is one that is rooted in reality and pragmatism, not theory and idealism. Your goal should be to keep your team as happy as reasonably possible by mitigating toil and automating communication between them, even if you don’t always have the perfect response plan on hand. You should strive to automate the playbooks you have, rather than trying to develop a playbook for every possible incident that could theoretically occur. You should certainly perform root-cause analysis, but never at the expense of solving the incident in the most expedient way possible.
The conversation around generic mitigations is valuable as a reminder that we need to take a real-world approach to reliability management. Achieving perfection is great if you can do it without sacrificing other priorities. But you typically can’t, and you should focus on getting the most out of the incident response team, tools, and strategies. That means automating whatever you can, and not beating yourself up about processes that are not perfect.