Traditional vs. Modern Incident Management in 2020

Cloud has given way to cloud-native computing over the past decade. The sweeping changes at every layer of the application stack have had a bearing on incident management. In this post we compare traditional incident management with modern incident management, and highlight what's changed. We also look at ways DevOps teams can evolve with these changes.

 

 

Traditional Incident Management

Modern Incident Management

Infrastructure & app architecture

Server or VM-based, monolithic

VM or container-based, microservices

Team structure

Single-function, bureaucratic, large

Independent, cross-function, polyglot, pizza-sized

Deployments

Imperative, clearly-defined steps

Declarative, GitOps, environments as code

Who owns reliability

IT 

Shared between all Dev & DevOps team members

Primary contact

IT / SRE

On-call engineer / SRE

Incident detection

Minutes to hours

Seconds to minutes

MTTR

Hours to days

Minutes to hours

Tooling type

Custom- built, or on- premise

Cloud-based, or open- source

Collaboration tools

Ticket-centric, lots of files shared

ChatOps

Postmortems

Blame shifting & blame pinning

Solution-oriented

Attitude towards failure

Avoid at all costs

Fail fast, fail forward

Incident management follows the broader trends in the DevOps world. Fundamental changes such as microservices have a bearing on how incident management is implemented. 

Accept the Distributed and Sometimes Fragmented Environment

Modern incident management needs to take into account the distributed nature of infrastructure, applications, and teams.  Microservices running on container instances that span multiple public cloud platforms is the new normal. Teams are independent, each having their own favorite programming language, open-source/DevOps tools and processes. 

Deployments are declarative, inching closer to the GitOps model that treats environments as code. There is automation at every step to enable changes to be triggered in real-time. The minutia is abstracted, giving developers the ease of running apps without worrying about the infrastructure; Ops teams can run infrastructure without configuring memory, networking, and storage. With all this power, when things do go wrong, they can go badly wrong. 

What this means is that the attack surface and likelihood of failures increases exponentially. Trying to keep up with all this complexity and change, incident management becomes fragmented. It's not possible to manually keep up with every incident. It takes an automated incident management solution that can span the breadth of multi cloud, all microservices, and all teams. 

Embrace An Open Toolbox 

The arsenal of tools in the cloud has changed drastically over the past decade. Organizations are not willing to cede control to the biggest cloud vendors like AWS and VMware, and instead, want the freedom to move their workloads to whichever cloud suits their needs best. Kubernetes is the unanimous 'operating system' of the cloud. It is augmented by a suite of open source cloud-native tooling such as Istio, Jaeger, and Helm. 

Monitoring has similarly gone from being done with a single tool to integrating multiple open source tools. The ELK stack, FluentD, and Prometheus are prominent solutions. There are also cloud-based monitoring vendor tools for log analysis, SIEM, and APM. Like the infrastructure they monitor, these tools cannot be standalone and need to be integrated via APIs. 

This monitoring toolchain is incomplete without an incident management solution. An incident management solution completes the monitoring loop and helps make the leap from insight to action. 

Align with the Modern Culture

Culture is central to the DevOps methodology. The best architecture and tooling will fail without a change in the way teams are structured – the common language they speak, the expectations laid on them, and the spirit of collaboration. This is especially true in today's scenario of remote DevOps teams. Communication can easily fall through the cracks. 

The role of the on-call engineer is vital to the practice of incident management. In most cases, this isn't a person hired for this specific role, but rather, the entire DevOps team is taking turns to do on-call duty. For this setup to work well, it requires careful and up-to-date documentation and training in the form of runbooks. These are clear step-by-step instructions directing the on-call engineer what is to be done in the case of an emergency. Communication should be clear and straightforward so that the on-call engineer isn't caught unprepared. 

Once an incident is resolved, it is important to avoid the blame game, and instead look to glean learnings so that the same incident isn't repeated. Even if a team member causes a big failure, not penalizing or shaming them sends a strong message that failures are part of progress, and they need to be managed well. This does wonders to transform the spirit of a DevOps team from apprehension and hesitation to one of confidence, experimentation, and innovation. 

Welcome Proactive Incident Response

Despite the challenges that modern cloud-native computing presents, incident management is only becoming faster and more responsive. With the right setup, incidents can be detected within seconds and resolved in minutes. 

Slack and its ChatOps approach to incident resolution is a key enabler of incident management at this speed. It enables DevOps teams to collaborate in a single shared space and ensure everyone has the most updated view of the incident in progress. 

StackPulse supercharges this collaboration by enriching and correlating alerts before sending them to Slack, ensuring that DevOps teams have the complete picture of an incident.  StackPulse also provides a powerful playbook engine teams can leverage to remediate incidents or perform additional maintenance if needed - all from Slack.