A Guide to DevOps Monitoring
DevOps monitoring is the practice of implementing processes and tools to enforce visibility into the DevOps pipeline. Typically, monitoring implementations collect metrics that are used for alerting, notifications, and logging. In DevOps, there are four important areas you should monitor—latency, traffic, errors, and saturation.
You can set up DevOps monitoring by using a variety of metrics, including host-based metrics, application metrics, network and connectivity metrics, server pool metrics, and external dependency metrics. When configuring monitoring, you should also take into account available resources, component functions, deployment environments, and metric utility.
In this article, you will learn:
- An introduction to DevOps monitoring
- What are metrics, and why do you need to collect them?
- The four golden signals of DevOps monitoring
- What type of information is important to track?
- Factors that affect what you choose to monitor
An Introduction to DevOps Monitoring: Metrics, Logs, Traces, Notifications, Alerts / Pages
To understand monitoring in DevOps and to implement it effectively, you need to understand the basics first. Monitoring is the process of collecting, correlating, and processing data about systems or operations. In this process, quantitative, real-time (or post real-time) data is collected for predefined events. For example, query processing times, request rates, bandwidth use, or data access. In DevOps, continuous monitoring is essential as most other processes are also continuous.
To accomplish monitoring, teams implement the following:
- Metrics – sets of quantitative data points describing the most important parameters of operations of an infrastructural or an applicative component. Metrics data is processed in real-time or very close to it.
- Logs – records describing important events that take place in the system (a record, or log, is issued for each event). Log data usually cannot be processed in real-time due to delays related to collecting and transferring the logs. Modern logging infrastructure aims at being able to process logs “close to real-time”.
- Traces – very verbose records describing every step or operation performed by each component of the system. They can be generated either by meticulously adding functionality that provides traces inside the software code or by instrumenting code that does not contain traces and injecting them. Traces are extremely verbose, and there are high computational costs and storage costs involved in issuing traces. Therefore, in many cases, traces are enabled on-demand or issued only for specific periods of time.
- Alerts—issued by specified tools based on metrics or logs data, if the data reflects a problem in the underlying system. Without alerts, metrics, logs, and traces, data can only be accessed on-demand. Alerting tools are used to prompt manual action, as well as to ensure that teams are aware when a system is automatically responding to a potential incident. Alerts are typically sent for issues related to insufficient resources, security, failed processes, or failure of system resources.
- Notifications—notifications are like alerts but are used to inform teams and users about standard or expected events. For example, when builds pass testing, when updates are complete, or when analyses are ready. Notifications can also be used to distribute information to users. For example, explaining why passwords were reset or why assets are temporarily unavailable.
What Are Metrics and Why Do We Collect Them?
Metrics are often the primary outputs of monitoring. These values are either raw or contextual measurements that teams can use to assess performance, predict capacity, and optimize processes. In DevOps pipelines, you can use metrics for a variety of reasons, including to gauge team productivity, measure application quality, and evaluate component operations.
DevOps metrics are most useful when in combination to present an overall view of your systems and processes. By combining these measures, you can begin to derive insights into the true compatibility of your various components. You can also use metrics as a litmus test for changes you enact.
The Four Golden Signals of DevOps Monitoring
The Four Golden Signals of Monitoring model suggests the primary issues you should be focusing on. Collecting information on other events is ideal, but if you can only watch four traits, these provide the greatest value.
- Latency—how long a process takes. When collecting latency data, make sure you are distinguishing between successful and failed requests or attempts. If you look at latency as a whole, fast error responses can be mistaken for low latency overall.
- Traffic—the demand on your servers, applications, or network. This typically includes the number of requests, concurrent sessions, throughput, or I/O rate. Ideally, this should be contrasted with your saturation.
- Errors—the rate of failed processes or requests. This can be based on system determined failures, such as timeouts, or on policy failures. For example, serving the wrong content in response to a webpage request. Depending on the type of error you’re trying to monitor, you may need to implement end-to-end tests rather than relying on event statuses, such as protocol response codes.
- Saturation—the amount of available capacity being used in your system or components. This measure can provide context for all other measures and can help you identify bottlenecks. It can also provide an indicator for when you need to scale up your resources or put limiters in to prevent system overload.
What Type of Information Is Important to Track?
When monitoring metrics, you need to narrow down which measures are available to you. The specific metrics you can use will depend on your tooling and any custom data analyses you create.
Host-based metrics inform you about the performance and health of individual environments or machines. These measures can help you understand what infrastructure factors may affect your hosts.
Standard host-based metrics include CPU, disk space, memory, and the number of running processes.
When using virtualized environments and running jobs inside containers, host-based metrics are not sufficient. Similar metrics should be collected about each container running on a host. In fact, when using “managed” container orchestration environments or serverless hosting environments, such as AWS Elastic Kubernetes Service (EKS) or the Google Kubernetes Engine (GKE), the focus should be on container metrics. Host-based metrics will be replaced with “applicative” metrics regarding the server pool.
Application metrics inform you about services or processes that rely on hosts. These measures vary according to application dependencies, functionality, and connectivity to other components. Application metrics can indicate the performance, load, or health of your applications.
Standard application metrics include resource use, latency, service failures or restarts, request response rates, request/response sizes, and so on.
Network and connectivity metrics
Network and connectivity metrics inform you about the status of both internal and external networks. These measures can help you evaluate the availability of your applications to users and the accessibility of internal services to each other.
Standard network and connectivity metrics include bandwidth use, latency, packet loss, error rates, and concurrent sessions.
Server pool metrics
Server pool metrics inform you about the health and performance of clustered or distributed resources. These measures can help you evaluate how effectively workloads are being distributed, how available your services are, and how resources can be scaled to function more efficiently.
Standard server pool metrics include the number of failed or degraded instances and overall resource use.
External dependency metrics
External dependency metrics inform you about performance or availability issues related to external or third-party services. For example, if you are using external payment processing or authentication services.
Standard external dependency metrics include operational costs, run rate, success or failure rates, and service status or availability.
User experience metrics
All of the metrics described above are focusing on understanding the operations of the service components. These are considered white-box metrics, i.e., data that is based on a knowledge of how a certain system is engineered from the inside. As modern systems consist of multiple components, in many cases, it is quite challenging to understand the impact of a specific individual internal metric on the eventual users of the system. In order to resolve this challenge, it is recommended to add user experience metrics that are obtained in a black-box manner, i.e., using mechanisms that either monitor the end-users operating on the system and measure the response time, stability, and other relevant parameters of the user experience. An alternative approach to obtaining user experience metrics would be to use continuous synthetic testing automation (i.e., automatic processes that emulate the flows of the regular users) and take metrics from it.
Factors That Affect What You Choose to Monitor
Start by understanding your system, both from a white box perspective (considering its internal operations) and from a black-box perspective (how an end-user views it). These two perspectives will dictate the metrics that are most critical to measure and to alert on.
Most DevOps monitoring tools can provide many times more information than a team can actually manage, so it’s important to make sure you are extracting the data that is most important for your goals.
When considering various approaches, consider the notion of observability of a system. Observability, a notion taken from a control theory, is defined as the extent to which the system state can be understood from its external outputs. While, in theory, everyone would love to have 100% observability all the time, due to various constraints, this may not always be the case. For example, because not all your components can be instrumented to a similar extent, and due to the high cost of storing and processing high cardinality data.
When determining your measures, keep the following factors in mind:
- Available resources—this includes human resources, budget, and infrastructure. You need to have enough bandwidth, computational power, and storage to collect and process measures. You also need to have the time and staff to review metrics and apply measures meaningfully.
- Component function—the measures you collect should be tailored to your component. Focus on measures related to mission-critical functionality and performance.
- Deployment environment—some environments are more critical than others and so should get more attention. For example, production environments vs ephemeral development environments. This doesn’t mean that you shouldn’t monitor all environments, just that you should be selective in what measures you’re using and when you alert for issues.
- Metric utility—it may seem obvious, but you should not waste effort collecting metrics that aren’t functional. Unless you are doing a deep analysis of your processes and configurations for optimization, many metrics are not useful. When selecting measures, focus on those that support your immediate and pressing goals.
Monitoring and Observability Platform vs. Best-of-Breed Stack
Software services observability refers to the ability to understand the internal state of components by checking their outputs, rather than ‘intruding’ into their production environments. Observability can be achieved via comprehensive monitoring, logging, tracing, and alerting platforms, such as DataDog. Or it can be achieved by building a unique platform suitable for the needs of a specific organization.
A comprehensive approach for operationalizing software services should include the following components:
- Infrastructure monitoring
- Network performance monitoring (NPM)
- Application performance monitoring (APM)
- Real user monitoring (RUM)
- Security monitoring
- Distributed tracing
- Log management
Platforms like DataDog offer all of the above in a single platform. Some of the components are included in subscription bundles, while others require separate subscription add-ons.
Many organizations may also want to add more operational components, such as the below. While some of these are included in platforms, many require additional solutions or integrations:
- Component ownership management
- On-call rotation management
- Service level objective (SLO) dashboards
Working with a single monitoring platform vs building a best-of-breed solution can have advantages and disadvantages. Based on our conversations with various organizations, these are some of the most commons considerations:
Pros of the Platform Approach
You should consider using a platform when your priority is:
- Simplified licensing, pricing, certification, and procurement—platforms can provide a significant advantage when procurement processes are required for the adoption of new solutions. This is especially true when underlying infrastructure services must be certified.
- Built-in compatibility between components offers better alerting and observability—compatibility makes life easier for engineers managing and operating the platform. However, keep in mind that this benefit varies according to the maturity of technologies in place. Compatibility may be more relevant in future development than at present.
- Possible lower learning curve—pre-built solutions may be easier for teams to learn and operate efficiently. This is because proprietary solutions often include built-in materials for training and tend to use fewer interfaces for operation.
Pros of the Best-of-Breed Approach
You should consider using a best-of-breed system when your priority is:
- Avoiding vendor lock-in—by combining components from multiple vendors and solutions you can avoid reliance on a single product or provider. This solution enables you to build in some protection from vendor price changes, internal conflicts, or changes to products that don’t match your needs. An alternative that some companies consider is having two pre-built solutions. One solution is kept in a fully operational state and another back-up one is kept in a semi-operational state. This backup solution can be promoted to operational if things change with the main vendor.
- Possible to reduce costs—some companies choose to use cheaper monitoring/log management/alerting systems (including self-hosted open source solutions) in non-critical environments. However, this creates the need to consolidate processes across multiple solutions. Despite this disadvantage, it is still a very common practice.
- Superiority and specialization of “niche” solutions—best-of-breed systems enable you to select the optimal solution for specific monitoring aspects. Although platform vendors strive to make platform components functionally on-par with specialized solutions, this isn’t always possible. However, open integration interfaces can allow stable and easy incorporation of these solutions, reducing the need to turn to best-of-breed systems.
What Happens When an Alert Comes In? Automating Your Response
Continuously monitoring the operation of software services is a fundamental part of both DevOps culture and of software engineering in general. Choosing the right tools for the job, and keeping the monitoring pipeline up-to-date with the software and infrastructure landscape will allow you to provide stable and reliable software services.
While monitoring enables the operations aspects of the DevOps culture, it is only a means, not a goal. The goal is to be able to provide uninterrupted service to the users. Operations activities that ensure this goal is reached, based on monitoring data, are as critical as having the monitoring in place. The efficiency of such operations, and the ability to react to events at scale before they affect the users of the service, is a business-critical need.
StackPulse is a platform that allows you to automate, analyze, and visualize operational aspects of Site Reliability Engineering. When an alert or a notification is sent, an automatic playbook is triggered by StackPulse, ensuring that a predictable set of operational steps is performed.
The StackPulse platform can help you perform the following actions automatically:
- Verifying the extent of the problem and its impact on components of the system – is it a mission-critical problem? How high would we prioritize handling it? Is this a false-positive or just a reflection of another issue that is already being handled?
- Checking potential causes for the problem, either disqualifying them or identifying the direction for the root cause.
- Updating all relevant persons and systems with the latest state of the potential incident.
- Ensuring all investigation and remediation steps are audited and documented in order to enable efficient post-incident analysis.
- Automatically applying remediations and monitoring their effect on the system to achieve self-healing.
StackPulse ensures that engineering organizations get a high return on investment on monitoring infrastructure, by making every alert and notification count.
Get early access to the StackPulse platform and try it for yourself!