#DevOps Toolset

#Alerts

#Prometheus

Getting Started With Prometheus Alertmanager: Concepts and Best Practices

Alertmanager is a Prometheus component that enables you to configure and manage alerts sent by Prometheus server and to route them to notification, paging and automation systems. However, before you can start using Alertmanager, you need to gain a deeper understanding of basic Alertmanager concepts.

Learn more about Prometheus and monitoring best practices in our article about DevOps monitoring.

In this article, you will learn:

What Is Prometheus Alertmanager?

Alertmanager is a Prometheus component that groups and routes alerts from Prometheus to other systems in your DevOps toolset. It works in collaboration with the alerting rules you create on your Prometheus servers. 

With Prometheus Alertmanager, you can handle alerts sent by your client applications. You can also deduplicate, group, silence, and route alerts through notification platforms like PagerDuty or OpsGenie.

Alertmanager Architecture

Prometheus Alertmanager architecture

Prometheus Alertmanager Concepts

When using Prometheus Alertmanager there are a few concepts to be aware of to ensure you are smoothly managing notifications. 

Grouping

Grouping enables you to categorize similar events into a single notification. This helps prevent your teams from being overwhelmed by alerts during times of crisis, such as system outages or failures. You can configure Prometheus group categories, timing, and targets via the routing tree in your configuration file.

Inhibition

Inhibition allows you to suppress notifications based on current conditions. For example, if one server goes down, the feature can suppress alerts related to workloads on that server.

Similar to grouping, this helps prevent being overwhelmed with alerts and reduces redundancy in your notifications. You can configure inhibitions in your Alertmanager configuration file. 

Silences

Silencing rules allow you to mute alerts for a given period. When an alert is triggered it is checked against active silencing rules and either allowed or silenced as appropriate. You can configure silences through your Alertmanager web interface. This prevents unnecessary alerts during scheduled maintenance windows.

How to Set Up Alertmanager in Prometheus

The following guide can help you get started with Prometheus Alertmanager. It walks you through how to set up your alerting rules and how to configure Alertmanager for operations. 

Create alerting rules in Prometheus

The first step to setting up your rules is to locate a dedicated Prometheus folder. In this folder, there should be subfolders for server, node_exporter, github_exporter and prom_middleware

Next, you need to create a rules.yml file from the command line or in the code editor of your choice. In this file, you can specify your alert conditions. This file should be saved to your server subfolder. To create it from the command line, you can use the following command:

cd Prometheus/server
touch rules.yml

With your file created, you can begin defining your rules. This requires understanding which Prometheus metrics are available to alert on and what the values for those metrics mean. For example, if you want to create an alert notifying you when instances go down, you can use the up metric. This metric is binary with running instances marked as 1 and down instances 0. 

Adding a rule to your file to alert when instances are down should look something like the following:

groups:
  - name: AllInstances
    rules:
     - alert: InstanceDown
       # Alert condition
       expr: up == 0
       for: 1m
       # Alert information
       annotations:
           title: 'Instance {{ $labels.instance }} down'
           description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.'
       # Alert labels
       labels:
           severity: 'critical'

You should add any additional rules you need and save your file. Once saved, you need to link your rules file to your prometheus.yml and configure your alerts. This involves specifying scrape frequencies, setting default ports, and setting your scrape targets. This configuration should result in a YAML file that looks something like the following:

global:
 scrape_interval:     10s
 evaluation_interval: 10s

rule_files:
 - rules.yml

alerting:
 alertmanagers:
   - static_configs:
     - targets:
       - localhost:9093

scrape_configs:
 - job_name: 'prometheus'
   scrape_interval: 5s
   static_configs:
     - targets:
       - localhost:9090

Set up Alertmanager

Once your rules are created and configured, you’re ready to set up your Alertmanager. Start by creating an alert_manager subfolder in your Prometheus folder. Next, download and extract Alertmanager to your folder. This places an alertmanager.yml in your folder.

To initialize Alertmanager, run the following command:

./alertmanager --config.file=alertmanager.yml

By default, Alertmanager dashboard listens to port 9093 on the host where it is running. Connect to this port in your browser and you should see the Alertmanager dashboard. From there you can monitor and manage alerts.

Prometheus Alerting Best Practices

With Prometheus Alertmanger in place, there are several best practices you can apply to ensure that your alerts are relevant and useful.

What to alert on

Try to limit the number of alerts you create to the fewest number possible. You can achieve this either by only creating limited alerts, grouping alerts to reduce notifications, or silencing alerts. In particular, try to prioritize your alerts to ensure that you are seeing the most relevant alerts. For example, those indicating top level outages or issues affecting end user experience. 

Ensuring that your monitoring infrastructure is working

In addition to alerting on your systems, you also need to monitor your Prometheus resources. This includes your Alertmanager server, PushGateways, and any notification platforms. Additionally, you may want to use external monitoring for your Prometheus configurations. This ensures that you are alerted if internal systems fail. 

In this monitoring layer, prioritize alerts for symptoms rather than causes and periodically test your systems. For example, you can perform blackbox testing to ensure that alert events can travel the full path from PushGateway to Alertmanager. Learn more in our article about Prometheus monitoring.

Monitoring SaaS environments

When setting alerts for components of software as a service (SaaS) systems, pay closer attention to  error rate and latency alerts higher up in your stack (closer to the actual consumers of the service). Alerts on the same issues in lower level components may not necessarily have an impact on the end-users, and should be handled differently. Likewise, if lower-level errors do not affect users, you may be able to monitor the errors with lower urgency. 

One exception to these guidelines is if errors reflect security or business goal concerns. In these cases, you want to actively alert your teams. You should also alert if errors are likely to get drowned out otherwise. For example, errors in low-traffic components compared to high-traffic. 

Batch jobs

When creating alerts for batch jobs, you want to set alerts for a reasonable time period. As a general rule, you can base this period on the time it takes to perform two full runs of the job. This helps ensure that you aren’t alerted every time batch completion is a little delayed. 

Operationalizing Prometheus With StackPulse

Monitoring production environments of modern software services is essential for resilient operations. However, alerting by itself isn’t sufficient to keep services up and running.

StackPulse is an orchestration and automation platform for site reliability engineering. It natively integrates with Prometheus Alertmanager and other alerting systems, automatically triaging, enriching and acting upon alerts in real time. This reduces alert fatigue, speeds MTTD by identifying root cause automatically, and streamlines incident remediation by automating manual tasks.

Get early access to StackPulse to learn how to supercharge your alerting systems, reduce toil, and improve reliability.