How to Implement a DevOps Culture Using SRE
DevOps and SRE are sometimes referred to as competing or separate disciplines. This post compares and contrasts the two, showing that they are not really competitors but rather complementary to one another. We will also explain how to implement DevOps as a culture by using SRE principles and best practices – with a bit of Agile thrown in for good measure.
DevOps vs. SRE?
In the dark ages (not so long ago), Development and Operations were treated as inherently separate disciplines, each having its specialized roles and functions. Development would throw their code “over the wall,” and Operations would make it run in production. Devs focused on features and had little visibility into how their code ran in production; meanwhile, Ops concentrated on reliability and stability. This focus was due (at least in part) to the fact that Ops teams were the ones who were paged in the middle of the night when something went wrong. Development wanted to move faster, while Operations wanted to move slower. As a result, there were tensions between the two. In many organizations, these tensions still exist to this day.
DevOps was born in response to this dichotomy. It seeks to break down traditional organizational barriers by streamlining processes and improving time to delivery.
What Is DevOps?
DevOps is the intersection of two key disciplines: Software Development (Dev) and IT Operations (Ops). It is comprised of the following key principles (or pillars):
- Reduce organizational silos by breaking down barriers between teams and increasing collaboration.
- Accept failure as normal. Systems are treated as inherently unreliable. Bugs are normal. There is a focus on risk reduction instead of risk avoidance.
- Implement gradual change. Incremental changes allow for easier code reviews, and smaller changes are easier to roll back when (not if) a bug makes it to production.
- Leverage tooling and automation. This may seem like a no-brainer, but operational processes are often partially – if not fully – manual, necessitating specialized teams and roles to perform functions that should be automated. Automation improves reliability and repeatability, and it also enables teams outside of Operations to perform functions that previously required special knowledge or training.
- Measure everything. By doing so, you will know whether your initiatives are working or not. These metrics should be available to everyone in the company to create transparency.
What Is SRE?
Site Reliability Engineering (SRE) is the application of software engineering to operational problems. Site Reliability Engineers have a distinct role in an organization, and they teach application developers how to build reliable services.
SRE seeks to implement DevOps through the following practices:
- SRE is based on the concept of shared ownership. Tools and responsibilities are shared between the developers and the SRE team. This addresses the DevOps pillar of reducing organizational silos.
- SRE seeks to embrace risk through the use of Service Level Indicators and Service Level Objectives as well as blameless postmortems. This is how SRE implements the DevOps pillar of accepting failure as normal.
- SRE seeks to reduce the cost of failures so that developers and product teams can move quickly to deliver new features and applications. This makes gradual change possible.
- A core tenet of SRE is automating menial tasks. Automation addresses the DevOps imperative of leveraging tools and automation.
- SRE uses specific methodologies to measure values and treats operations as a software problem.
- SLIs and SLOs provide feedback to SREs and Devs, while error budgets help satisfy the DevOps requirement of measuring everything.
One analogy to help visualize the comparison between the two is to think of DevOps as a philosophy and SRE as a set of tools and practices that can be used to implement that philosophy.
4 Ways to Implement a DevOps Culture Using SRE
So how can you use SRE to implement a DevOps culture at your company?
Unless your engineering organization is tiny (or unified), new initiatives that aim to change the existing work culture will likely be met with resistance. This can be addressed in part by making incremental changes that limit the scope of resistance and enable issues to be addressed faster. Start with educating everyone involved in the development process (including Product, Development, and even Operations teams) about changes to the platform on which the applications are deployed.
Educating employees about the advantages of new methodologies and processes can help quell resistance to adopting them. Teaching development teams how to use automation and tools that are normally reserved for operations teams demystifies these otherwise opaque processes. It also encourages Devs to take on more ownership as they take more responsibility for their applications.
2. Share Ownership
Shared ownership between Operations and Development is another important concept that can be used to reduce organizational silos.
All teams should use the same tooling and have the same permissions to perform tasks. If developers cannot perform certain functions due to a lack of privileges on the application platform, this will create a barrier that requires Devs to rely on Ops to handle certain tasks for them.
This shared ownership should go both ways. Traditionally, Ops responded to outages and incidents since the production environment was almost entirely their sole purview. Being on-call was part of the Ops role, and it was usually seen as one of the most painful responsibilities of Ops engineers. Anyone who has worked in Ops will have a horror story about receiving a page at 3 AM and having to troubleshoot a crashed application without knowing how the application code worked.
Educating Devs about how the platform environments operate and showing them how their code runs in production will go a long way towards improving system reliability. This will lead to shared responsibility for production outages; Devs and Ops will be paged simultaneously, and they will work together to resolve the issue.
3. Create Feedback Loops
Many modern organizations use Agile software development methodologies as a part of their software development lifecycles. Just as SRE and DevOps are closely related, DevOps and Agile are closely related. Therefore, certain key Agile practices can also help to create a DevOps culture. One such practice is the creation of a feedback loop.
If Devs and Ops teams are using the same tools and have the same privileges, you already have a partially implemented feedback loop. When something doesn’t work as expected, both teams feel the pain. Normally, Ops is more likely to fix problems because they usually have special privileges or proprietary tooling that is not available to other teams.
Another key part of a feedback loop is providing a way for Devs to file bug tickets with Ops when a problem is found, or a new feature is requested. Many software products can do just that, and the company’s Product and Development teams are probably already using one. Operations should be no exception.
Another way to break down silos and encourage feedback is to have Dev teams contribute to software projects that Ops teams are building. When Dev teams can contribute code to the platform and tooling, they will feel a greater sense of ownership than they would if Ops teams simply threw a new tool over the wall for them to figure out for themselves.
4. Provide Focus Groups, Retrospectives, and Postmortems
Regular feedback sessions should be conducted between Ops and Dev teams. These sessions help to improve communication and encourage shared ownership. Ops should seek to understand the use cases that Devs encounter, which will help make Devs feel that their input is valued. This will also help satisfy another Agile best practice, which is gathering requirements.
You can also leverage focus groups to get meaningful feedback from your teams. For example, when Devs participate in creating the Ops roadmap, they can help guide Ops teams and ensure that the right platform and tooling features are delivered when needed. You’ll see improved cross-team collaboration in developing the application platform along with features that weren’t developed in a vacuum.
You can also leverage retrospectives and postmortems to gain feedback. During a retrospective, teams reflect on what happened during the last release cycle and identify actions for improvement going forward. In the case of postmortems, teams should consider using the “blameless postmortem” as implemented at Google and other companies. Both of these Agile practices encourage ownership and help break down silos. They also help reduce risk by accepting failure as normal and providing a path for improving internal processes in addition to platforms and applications.
We hope that this article has clarified some of the differences between DevOps and SRE and shown you how they can be used together to achieve a DevOps culture within your organization. For more information, you can take a look at Google’s take on SRE as well as this excellent series of videos that they posted on YouTube.