Incident Management and Response: Myth Busting Edition
Site Reliability Engineering – or SRE – is what happens when you ask software engineers to design an operations function. That is how Google VP Ben Sloss described the background and definition of SRE. The idea of SRE originated at Google. Since its inception, more and more organizations are adopting tactics related to SRE to maintain their competitive edge in information technology’s rapidly evolving world.
As it’s a relatively new discipline, there is still a lot of misunderstanding and confusion about the role of SRE and what these teams do. I recently wrote an article describing four steps you can take to automate incident management using SRE principles. In this post, we will discuss some of the misconceptions and myths around the SRE role. We’ll also talk about automating incident management and new strategies for responding to incidents.
Let’s get down to some significant myth-busting with the following top five myths!
Myth 1: The Goal of SRE Is 100% Uptime
Achieving 100% uptime is a noble goal, but it’s next to impossible to achieve. An adverse side-effect of performance targets like this is that they negatively impact engineers’ motivations to experiment and take risks. In today’s highly competitive marketplaces, taking risks is necessary to push the envelope and gain a competitive edge over the competition.
The SRE team’s goal is to find a balance between stability and risk, and they do this with a concept known as an Error Budget. Organizations calculate their error budget by determining the maximum amount of time that a system can fail without contractual consequences. That maximum amount of time is dictated by the Service Level Agreement – or SLA – for an application or service. If the SLA for a system is 99.95%, the annual error budget is 4 hours, 22 minutes, and 48 seconds.
Keeping the error budget in mind, engineering teams focus on reducing incidents and increasing their buffer to take risks and experiment with changes. Ultimately, the error budget helps the engineering team strike a balance between system stability, innovation, and experimentation. The SRE team is pivotal in managing this process and helping other engineering teams achieve the right balance.
Myth 2: SREs and the IT Hierarchy
Site reliability engineers divide their time between automating deployment, monitoring, and incident response processes, while working alongside development teams to understand and improve the team’s procedures. Unfortunately, this can lead some software engineers to assume that the SRE roles take their direction from the development team or function as a necessary burden from upper management.
The reality is that SRE is an essential part of DevOps, and the relationship of development to SRE achieves the best results when viewed as a symbiotic one. SRE teams work with developers to determine error budgets, streamline processes, and optimize incident response. By understanding and respecting the unique roles and perspectives of everyone involved in developing, deploying, and supporting software, DevOps engineers and site reliability engineers accelerate the development of features and improve the stability of the production environment.
Myth 3: Playbooks Are Just for Documentation
Many teams have developed Playbooks to guide support personnel by identifying the root cause and responding to incidents appropriately. The value in your team’s Playbook can far exceed that of helpful reference material and troubleshooting documentation. In a previous article, I talked about how your Playbooks can form the foundation of a response-as-code plan.
Response-as-code refers to the practice of identifying common incidents and automating the process of identifying the source of the problem, and then performing actions to mitigate and resolve the issue without relying on your on-call engineer or production support team.
Myth 4: SRE Work Is the Same Wherever You Go
As I mentioned at the beginning of this article, the concept of SRE originated at Google. On their SRE site, Google offers various resources, books, and a series of workshops in their SRE classroom. One might be forgiven for thinking that a site reliability engineer’s role was clearly defined and was the same from one organization to another.
While an SRE engineer’s principles and techniques are transferable from one organization to the next, the daily tasks, pipelines, and solutions that engineers work with are unique. A site reliability engineer needs to understand the business requirements, the specific technical stack, and the organization’s deployment and observability tools.
Site reliability engineers draw from a solid foundation of technical and soft skills to automate processes and communicate with DevOps engineers and business stakeholders. As we’ve already discussed, these engineers play a pivotal role in achieving a sustainable balance between stability and experimenting with new ideas and features.
Myth 5: SRE Eliminates the Need for Human Intervention
In their SRE book, Google recommends that SRE engineers divide their time evenly between operations management and automation projects. The goal of automation might lead you to believe that the SRE strives to eliminate the need for human intervention, but that’s not entirely true. Automation should be considered a force multiplier, which reduces the time engineers spend on tedious and repeatable tasks and increases the time they can spend on improving processes, tackling more complex problems, and pursuing innovative ideas.
A primary goal of the SRE team is to eliminate toil. We define toil as any work that is manual, repetitive, devoid of enduring value, and automatable. Reducing toil in your organization can have the effect of improving job satisfaction, accelerating progress, and creating a better, more productive environment for all involved. To put it a different way, it allows your engineers to devote their mental faculties to the parts of your business that will improve your bottom line.
As a new and developing field, SRE is both challenging and exciting for everyone involved. We’ve busted some of the myths out there, but this isn’t an exhaustive list. If you would like to learn more about SRE, Google and other organizations provide excellent resources for prospective SREs and organizations looking to include SRE in the development processes and strategy.
Another approach you might consider taking is partnering with an experienced third-party provider of reliability services. StackPulse is one of those providers seeking to provide best-in-class support and resources for the modern engineering enterprise. You’ll also find their blog to be a treasure trove of insightful articles and helpful tips on everything from incident management to best-in-class tools for your SRE team.