5 Tips for a Faster Incident Response Process

If you work in IT Ops, SRE, or DevOps, you don’t need to be told that every second counts in incident response. You already know that.

The challenge for most incident response teams, however, lies in figuring out how you actually improve the incident response speed. Beyond obvious, basic steps – such as taking advantage of automation tools for alerting and monitoring, and doing effective post-mortems – strategies for making the incident response process faster can be elusive.

With that reality in mind, here’s a look at five practical (and not necessarily obvious) ways to improve the incident response speed.

1. Automate Incident Analysis (in Addition to Response)

As noted above, most teams already know that they can streamline the incident response process by automating the monitoring and alerting process. That’s a basic best practice.

But you can take automation a step further by using automation tools to help interpret, contextualize, and define the priority of each alert with automated incident response. This level of automation can save a tremendous amount of time in incident response because it eliminates the need for your team to look up contextual information, such as which connections your application had open when the incident occurred, or what the host environment looks like.

It also helps to prevent your team from becoming distracted by minor incidents (like a non-critical performance issue on a dev/test server) when there are more important ones (like a critical failure in a production application) to respond to. It leads to a faster correlation between multiple instances that result from the same root cause, too, so that engineers waste less time having to interpret the relationships between different incidents before they can respond.

2. Concentrate on Root Causes (Instead of Putting Out Fires)

Speaking of root causes, identifying and resolving root-cause problems should be the core focus of your incident response strategy.

You probably know that already. But the reality of incident response is that it can be difficult when you’re constantly being distracted by other fires during your day-to-day operations. You give in to the pressure to extinguish them rather than diving deeper to fix the underlying problems. When you’re under pressure to get things back up and running as fast as you can, fixing surface-level issues so that things start working again – even if only temporarily – is often more tempting than taking the time to address root causes. You figure you’ll work on the root cause when things slow down and you have more time, but that time never comes amidst the constant crises.

It’s important to resist this temptation and focus on root causes, even if it means letting some fires burn longer than you would like. In the long run, prioritizing root causes will lead to faster incident response overall. The reason why is obvious: by addressing root-cause issues, you will reduce the number of overall incidents – not to mention the effort required to analyze relationships between incidents in order to determine which ones result from the same root cause. With fewer issues to manage, you can respond to incidents faster.

3. Predefine Incident Response Playbooks

Incident response playbooks that spell out how to respond to incidents of different types are another way to speed up your response process.

Granted, it’s impossible to write playbooks that anticipate every possible incident response scenario with complete accuracy. Some teams are dismissive of playbooks for that reason. But the reality is that, even if your playbooks don’t match the incident you’re facing exactly, they can still save a lot of time because they eliminate the need to plan your response from scratch.

For example, imagine that you experience a failure on a server that is running one flavor of Linux (Ubuntu, let’s say), but your playbook for Linux server failures was written for a different flavor of Linux (like Red Hat Enterprise Linux). Your team may need to adjust its tooling a bit when performing the analysis and response, but the core processes for getting the server up and running will probably be about the same: you’ll check things like memory allocation, disk space, kernel logs, and so on. Having a playbook that guides the overall process, even if some steps require ad hoc adjustment, will lead to a faster resolution than having no playbook at all.

4. Define Incident Response Roles

The default approach to incident response at most organizations is to designate on-call engineers for each shift. The on-call engineers are responsible for initiating an incident response. If they can’t handle an incident on their own, it’s up to them to determine who else to involve.

That’s a simple strategy, but it rarely leads to the fastest response times. A better approach is to align incident response roles with different types of incidents. A front-end developer could be assigned to handle incidents that involve app front-ends, for example, while database engineers are responsible for database-related failures. (You can streamline this process further by using an incident response platform that can categorize incidents and align them with different roles automatically.)

This way, instead of relying on a manual process for bringing the person with the right expertise into each response, you can assign the incident to someone who is likely to know how to resolve it out of the gate.

5. Communicate in True Real-Time

A final best practice for improving incident response speed is to communicate with your team in real-time.

The value of real-time communications is probably something you already recognize. But the mistake many teams make is that they use communication strategies that are close to real-time, but not quite there. For example, they might use email or their ticketing system to coordinate their responses. That works if you can afford to wait on each team member to check for new messages.

But it doesn’t get you to true real-time conversations. Nor does it allow you to share information automatically whenever the status changes.

A better approach is to rely on true real-time communication platforms like Slack. It’s also just as important to integrate automated notifications into them. That way, when a server reboots, for example, or a new application version is deployed, your team will automatically and instantly receive a notification.

To make incident response faster and better, think beyond the low-hanging fruit of alerting automation or post-mortems. Embrace additional strategies as well, such as predefined playbooks, intelligent assignment of incident response process roles, and automated incident analysis.

Read Previous Post
Read Next Post