Best Practices for Operationalizing Generic Mitigations
The beauty of a generic mitigation is that it solves a wide variety of problems using a single solution. When executed smoothly, generic mitigations allow fast remediation without requiring a great deal of troubleshooting or incident analysis.
But the key words there are “when executed smoothly.” If you don’t have the right plan in place for operationalizing generic mitigations, they may not deliver the fast, simple remediations that they should.
Keep reading for tips on how to execute generic mitigations in a consistent, efficient, and reliable way.
What Is a Generic Mitigation?
As we explained in an earlier post, a generic mitigation is a simple resolution that can fix – or at least mitigate the impact of – a range of different types of problems.
An application rollback is a common example of a generic mitigation. Rollbacks stop a problem within a specific application release from impacting your users. On its own, a rollback doesn’t fix the underlying issue in the application code, but it will stop the disruption.
Problems in Operationalizing Generic Mitigations
While generic resolutions offer a simple and straightforward means of remediating incidents, they can also pose challenges:
- Unintended consequences: A generic mitigation may have side effects that you didn’t anticipate ahead of time. For instance, rolling back an application release could cause a problem for another application that depends on the one you roll back.
- Deployment delays: In order to deliver on their promise of fast, easy resolution, generic mitigations must be easy to deploy. But some generic mitigation processes can be difficult to execute. For example, rolling back an application isn’t usually as simple as pressing a button; you must ensure that you fall back to an earlier release smoothly. Rollbacks are also complicated if you do things like canary releases, in which case you would have to figure out which instances of your application to roll back and which to leave untouched.
- Forgotten post-mortems: As noted above, generic mitigations can stop a problem from affecting users, but they don’t always fix the root cause. It can therefore be tempting to apply the mitigation and move on, without taking the time to do a post-mortem that lets you find and fix the root cause of the issue.
- Team buy-in: When dealing with complex incidents, there may not be instant consensus across your team about whether or not to use a generic mitigation. Some engineers may think that it would be wiser to try a different, less blunt approach. They may argue that an application rollback is overkill, for instance, and that there is a better way to fix the problem. Thus, getting everyone on-board with the generic mitigation – and ensuring that no engineer applies it without getting consensus – may be a challenge.
How to Ensure Smooth Generic Mitigations
There are several best practices your team can employ to address these challenges and ensure that generic mitigations are used appropriately and efficiently.
Generic Mitigation Playbooks
Incident management playbooks and generic mitigations should go hand-in-hand. Playbooks spell out the processes to follow in response to certain incidents. By creating playbooks that define when and when not to use a generic mitigation, it becomes easier to avoid contention within your team about whether to apply one in the midst of handling an incident.
Playbooks can also help ensure that you conduct post-mortems. If a generic mitigation followed by a post-mortem is part of your playbook, it’s harder to forget about the post-mortem.
And perhaps most importantly, playbooks can define the step-by-step operations that your team needs to perform to execute a generic mitigation smoothly, which guards against the delays that can happen when implementing a complex generic mitigation. They could even help automate the process.
Policies for Generic Mitigations
Whether or not you have playbooks in place, you should define policies that spell out when and when not to use a generic mitigation. Playbooks can’t cover every type of incident that could ever happen, so higher-level guidelines are helpful for ensuring that generic mitigations are used appropriately.
Your guidelines could include data such as how many users an incident has to be affecting before a generic mitigation becomes permissible, or how long your team should spend considering other remediations before settling on a generic mitigation.
Collaborating on Mitigations
The best way to avoid unintended consequences in generic mitigations is to ensure that all team members are plugged into the process – including stakeholders who are not directly responsible for whichever system the generic mitigation affects.
By streamlining communication and collaboration, you help ensure that engineers who maintain other systems can chime in if the generic mitigation you are proposing may impact their systems in ways you haven’t foreseen. They may also be able to suggest alternative approaches or ways to mitigate the side effects of the generic mitigation.
As open-source fans like to say, “many eyeballs make bugs shallow.” You could say the same about incident response and generic mitigations.
Generic mitigations can be a powerful tool, but only when wielded in an efficient, consistent manner. Playbooks, policies, and strong collaboration processes help teams avoid the pitfalls that can arise when executing generic mitigations.