#DevOps methodology

#Reliability

#SLO

Understanding Service Level Objectives (SLO)

What is a SLO (Service Level Objective)?

A SLO is an internal objective, defined using key performance indicators (KPIs) that are significant for the operation of a critical system. Operations teams, site reliability engineers (SRE), and developers use SLOs to set goals for key operational parameters of the systems they build and maintain. In some cases, SLOs may be reflected in the service level agreement (SLAs) exposed to end users of a system.

Metrics related to SLOs are shared with stakeholders, to show vendor commitment, ensure quality of service, and encourage accountability. These metrics are called service level indicators (SLIs), and can include availability of specific application components or services to end users, latency and quality of access for users from various locations, or acceptable partial unavailability timeframes. 

In this post, we’ll cover:

  • Why are SLOs important?
    • Site Reliability Engineers
    • Developers, Operations, DevOps
    • End Users
  • SLIs and SLAs
    • Service Level Indicators (SLIs)
    • Service Level Agreements (SLAs)
  • Implementing and Evaluating SLOs
  • Optimizing SLOs with StackPulse

Why are Service Level Objectives Important?

SLOs are important because they provide clear goals for multiple teams to work towards, and because they help set end user expectations (even though the end user is not exposed to the SLO directly). It is not always easy to agree to the targets that SLOs set, but nevertheless, SLOs can be a powerful tool for ensuring quality products and services with reasonable goals.

As part of the definition of SLOs, teams commonly use the concept of “error budgets”. An error budget reflects the understanding that critical systems can and will fail. Teams should understand what is a realistic goal for a critical system in terms of the frequency or severity of errors – this is the “error budget” for that system. They should set in place operational practices to ensure that the number or frequency of errors does not exceed the error budget.

Site Reliability Engineers

Service level objectives help site reliability engineering teams ensure they are focused on improving the metrics most important to the business. SREs can use SLOs to make data driven decisions around prioritization, and to understand costs and risks involved with specific operating decisions.

When used correctly, goals defined by site reliability engineers are made concrete by SLOs. This helps teams reach short term goals, while accounting for their maintenance in the long term.  

Developers and Operations Engineers

Depending on your methodology developers and operations teams may have opposing views and goals when it comes to SLOs. If these teams are not working cooperatively, their responsibilities and workflows may clash. For example, when working in silos, operations tend to prioritize stability of services while developers prioritizes additional features. 

To resolve this potential conflict, you can use the creation of SLOs and error budgets to find an understanding between teams. If error budgets are mostly unused, developers can freely focus on creating new features and operations can focus on long-term reliability. Once budgets start running low, members can shift and focus on collaborating to identify and correct stability issues. 

End Users

Although most end users would prefer if services were 100% reliable all the time, building an infrastructure that would almost-guarantee this might not be cost-effective (or would introduce constraints that would hurt users in other aspects, such as lack of functionality). 

You can set this benchmark as an ideal goal but not a realistic one. Instead, you need to find SLOs that can balance between providing greater value to your users through innovation, and providing better experience through reliability. 

One way to create this balance is to define error budgets that determine how much deviation is acceptable. Another method is making sure that you are measuring user experience accurately. For example, you shouldn’t rely only on the outcomes of support tickets. Positive support cases cannot account for users who don’t reach out, and don’t guarantee that your service or product is meeting user expectations.

Service Level Indicators and Service Level Agreements

SLIs and SLAs are two concepts that are closely tied to SLOs. One enables you to see how well you are meeting your objective. The other enables you to define a contract with your users for meeting your objectives.

Service Level Indicators

SLIs enable you to measure the quality of your service from the user’s perspective. Ideally, these metrics can be collected directly from the customer or customer device. If this isn’t possible, you can collect metrics data from the closest boundary between the customer and service. For example, if you are unable to incorporate a data collection agent on the client end, you can place one on your load balancer instead. 

Determining your SLIs is an iterative process that relies on your business requirements and available measurements. Typically, you start by defining possible SLOs and try to identify which SLIs are relevant. If measurements aren’t possible, you may need to rework your SLOs. Likewise, if valuable SLI data is available that SLOs don’t incorporate you may want to create SLOs that leverage that data.

When considering which SLIs are available or that you want to make available, keep the following categories and methods in mind.

What you want to measure?Measurement points or methods
Application-level metricsYou can expose metrics internally according to what data is being processed or which requests are being served. The caveat of using backend application-level metrics for SLIs is that you don’t see the picture from the point of view of the service’s users, who may have requests that don’t even reach the back-end.
Server-side logsProcessing server-side logs of requests or processed data to generate SLI metrics. The same concern as above applies to using these as SLIs; they miss the point of view of end users.
Front-end infrastructure metricsUsing metrics from load-balancing infrastructure, such as a layer 7 load balancer, to measure SLIs.
Synthetic clients or dataBuilding a client that sends fabricated requests at regular intervals and validates the responses. While providing a more realistic picture than back-end or even front-end infrastructure metrics, these still aren’t indicators on the experience of the actual users of the service.
Data-processing pipelinesCreating synthetic known-good input data and validating output. The same concern as above applies here as well – not a true measure of end user experience.
Client-side instrumentationAdding observability features to the client that the customer is interacting with, and logging events back to your serving infrastructure that tracks SLIs. While this is the most “expensive” method of collecting telemetry on the service experience, it is the only one that provides the actual data on the experience of the service users.

Service Level Agreements

SLAs are contracts between users and providers that guarantee a specific level of service. From the provider’s side, these agreements enable managing customer expectations better. It also makes it possible to define conditions that allow users to receive compensation in case of outage or service degradation.

From the customer’s side, SLAs provide a measure against which they can compare services. It also provides some protection against provider negligence and enables them to hold providers accountable for poor service.  SLAs are how the end user is exposed to the objectives that internal teams have determined as SLOs.

When agreed upon, SLAs are typically provided alongside or as part of the master service agreement. These documents are included to more clearly specify what a provider is responsible for, how responsibility is measured, and what the terms are for both parties.

Implementing and Evaluating Service Level Indicators

Defining SLIs is often a matter of selecting metrics from your performance management system. However, in some cases, existing performance management systems are focused on metrics that do not translate directly to customer-facing outcomes. 

Typically, performance management systems are set up to produce only a few essential performance metrics – having too many SLIs can be distracting and prevent teams from paying attention to the critical indicators. The metrics used should be those most crucial for the infrastructure or components of your system.

For transaction services, these SLIs are common:

  • Availability: Was the system able to respond to the request?
  • Latency: How long did it take to respond?
  • Throughput: How many requests was the system able to handle?
  • Error rate: How many of the requests sent in parallel succeed vs fail?

For storage services, for example, common SLIs are durability, which measures the likelihood that data can be kept over a long period, as well as bandwidth or latency for access to stored data.

You might need to combine and make more calculations to transform the infrastructure metrics to the few service metrics that you need. From an operational point of view, it’s worth documenting the transformation of the key performance metrics to SLIs. Doing so can provide a reference back to the source of the data for troubleshooting purposes.

Service Level Objective with StackPulse

When working within a framework consisting of well-defined Service Level Objectives and Service Level Indicators, engineering organizations are faced with two main challenges:

  • Managing incident response processes in order to get consistent flow of operations around the service.
  • Improving the efficiency of incident response processes to shorten MTTD (mean time to detection), MTTR (mean time to resolution) and MTBF (mean time between failures), contributing to improvement in error budgets and making it easier for the organization to meet its SLOs.

The StackPulse Reliability platform can help deal with both of these challenges by offering automation, management and analytics for the Incident Response processes. 

Whether an incident is identified automatically by a monitoring system or, unfortunately, reported by the service users, StackPulse can immediately execute automated workflows that will perform the following actions:

  • Verifying the extent of the problem and its impact on components of the system – is it a mission-critical problem? What should be its priority? Is it a false-positive or a reflection of another issue that is already being handled?
  • Checking potential causes for the problem, either disqualifying them or identifying the direction for the root cause.
  • Updating relevant persons and systems with the latest state of the potential incident.
  • Auditing and documenting investigation and remediation steps, in order to enable efficient post-incident analysis.
  • Automatically applying remediations and monitoring their effect on the system to achieve self-healing.

With StackPulse, a service-owning organization can codify and analyze its operational practices, ensuring that SLOs are met consistently across all service components. 

StackPulse allows service organizations to keep delivering an uncompromising reliability for their services while growing the demand while reducing toil, alert fatigue and context switches for the engineering teams.