How to Automate Incident Management with Code and Get Better Results
This post was originally published on The New Stack.
When something goes wrong in your production environment, you want your best and brightest minds to start working on the problem as soon as possible. The best time for me to work on a production problem is when I've got my day planned, I've just finished my morning coffee, and my mind is primed and ready for action. Unfortunately, production incidents seldom occur at this time; usually, it’s more like 3:24 AM. I know from experience that someone who stumbles out of bed and fumbles for their phone and laptop is not going to be in top form for at least ten minutes or so, if you're lucky!
What if you could leverage the mental faculties of the engineer whose mind is prepared and focused on building solutions to problems that happen at any time of the day or night? This article will discuss the evolutionary jump from the sleepy human to the support model, wherein the system itself automatically handles triage and the initial response to an incident. You'll find yourself with a more resilient system, and your engineers will be able to perform at their best and add more value to your customers. It's a win-win situation.
The Next Evolution: Response-As-Code
As software development has evolved, building and supporting applications has become more straightforward and organized. For example, with Infrastructure-as-Code (IaC), you describe infrastructure in machine-readable definition files and then check these files into the code repository alongside your source code, giving you a single source of truth for your application and the infrastructure you need to provide for its deployment.
A response-as-code plan is similar; you check in solutions and tools alongside your code, which can provide the foundation for automatically identifying and resolving problems without the need to involve an engineer. I will show you how to implement this plan below, but first, let me explain why you should consider doing it.
As someone who has supported production systems for many years, I've noticed a couple of things. The DevOps system is excellent for establishing ownership and producing a better product, but when engineers have their hands full developing new features and supporting systems, the risk of burnout and alert fatigue increases. Implementing a new plan will take time, and you will need to convince your team that it’s worth it, but the result will be less time spent troubleshooting common problems—as well as faster mean time to detection (MTTD) and reduced mean time to resolution (MTTR).
Get Started with Your Response-As-Code Plan
Your response-as-code plan will have a few critical components. You will begin by building some generic components used to identify and resolve common problems. You’ll also need project-specific components to accomplish the same thing for issues that are specific to each project. Then, you will connect all of these components for a comprehensive system that will automatically handle most problems.
Step 1: Begin with Your PlayBook
Most teams that I've worked on have put together a compilation of scripts and solutions for specific problems. A playbook can take many forms, from a shared document to a complex knowledge base. If you don't have a playbook yet, you should gather knowledge from your team members and compile one.
Whatever form your playbook takes, it will enable you to identify some common production problems that your team faces. You’ll begin by determining whether the problem is unique to the service or a more generic problem across multiple services. For example, you might occasionally run into disk space issues or sudden spikes in traffic that cause performance degradation. Once you can identify the problem and determine how to identify it programmatically, then you can design an automatic response to resolve it. It’s also important to keep in mind that a programmatic reaction might not work in some situations, so you need to ensure that you have an escalation path that involves an actual human in case the problem breaches a certain threshold.
One thing that I've found invaluable for implementing this step is to leverage your existing monitoring and Application Performance Monitoring (APM) solutions. Many of these products allow you to set up alerts based on specific criteria. You can use triggers to an API or a webhook to invoke a script to rectify a problem. In the past, I've used an invocation of AWS Lambda to resolve infrastructure needs automatically.
Step 2: Identify and Build Patterns to Detect Problems
Once you've picked off some of the low-hanging fruit by solving common problems based on your playbook, it's time to think bigger. Look across your organization, identify the core technology stack, and then begin compiling a library of code solutions that can automatically detect common problems. You can also reference previous production problems, which will help you identify and resolve the same issues programmatically in the future.
At this point, it's worth mentioning the work that StackPulse has been doing in this space. In their quest to make the tech world a more reliable place and provide resources for SREs and developers, they’ve already compiled standard playbooks for Redis, RabbitMQ, and other technologies.
Step 3: Build and Share Solutions
You can also begin compiling a library of potential solutions along with your collection of problem identification and troubleshooting tools. I mentioned an AWS Lambda that I built to resolve infrastructure problems under specific conditions automatically. The pattern that I used in that solution could be applied to remediate many issues within AWS, and the logic could be ported over to other cloud and on-premise solutions as well.
The greater potential of these first three steps will become more apparent when you begin to share what you've built with others and encourage them to participate. I've yet to meet an engineer that didn't get excited about automating solutions, and more importantly, reducing the risk of an after-hours phone call to fix a problem.
Step 4: Keep the Ball Rolling and Continue Coding Defensively
Importantly, these steps aren't a one-and-done solution. Implementing your plan will require constant awareness and maintenance as you add new features and technologies. You should strive to build a team and an organizational culture that invests in a robust response-as-code component for all new work moving forward. As I said above, automating responses to potential problems reduces the time it takes to resolve production problems and saves wear and tear on your engineers.
Moving Forward and Improving Continuously
As in the wider DevOps movement, your focus will be on building and establishing strong and resilient patterns for your teams to follow. You should be continuously looking for new ways to improve your process of designing, developing, and deploying software. A robust response-as-code plan will help you move your teams to the next level, and when you've mastered it, you'll be ready for the next iteration of improvements and innovation.
And on the topic of improving continuously, it’s key to be aware of the types of modern incident response tooling that are becoming more readily available today. You can read more about this by reading StackPulse’s article on “How the Incident Response Software Stack Has Evolved.”
If this is a topic that interests you, you should sign up for early access to the tools and community that StackPulse is building. You can sign up and learn more about what StackPulse has to offer.