Opsgenie: Alerts, Fundamentals, and Best Practices
On-call management platforms like Opsgenie enable you to define and manage the processes your teams follow when responding to alerts or incidents. You can integrate monitoring systems from your existing DevOps toolset with Opsgenie and manage follow-up to alerts raised. Opsgenie capabilities include actionable alerting, on-call schedule management, and advanced reporting and analytics.
In this article, you will learn:
- Opsgenie basics
- Opsgenie capabilities
- Opsgenie initial setup
- Opsgenie on-call best practices
What Is Opsgenie?
Opsgenie is an on-call management platform you can use to receive alerts from your monitoring systems and manage your responses. It enables you to centralize on-call management from across your monitoring systems and categorize alerts by priority, customer impact, or timing.
With Opsgenie, you can create on-call schedules defining who should be contacted and through what channel. Opsgenie supports push notifications, SMS, voice calls, and email. If your scheduled person does not respond, you can set the platform to automatically escalate the alert until a response is gained.
Learn more in our guide to DevOps monitoring, which walks you through metrics, logs, traces, notifications, and alerts.
When you adopt Opsgenie, you gain several capabilities, including the following:
- Consolidated alerting—you can use Opsgenie to group alerts, notify via multiple channels, filter excess alerts, and provide alert context. You can also integrate the platform with your existing ticketing, monitoring, and communication tools for streamlined workflows.
- On-call schedule management and escalations—you can define your schedule and escalation rules in a single interface. This same interface enables you to monitor your alerts and ensure that your team is aware of and accountable for on-call responsibilities.
- Advanced reporting and analytics—the platform automatically logs all alerting and event-related data for easy auditing, analysis, and reporting. You can use this data to track down the source of alerts, identify flaws in your workflows, or measure team performance.
Learn about it in our guide to DevOps automation, which explains how to streamline your pipeline while avoiding over-automation.
Opsgenie Initial Setup
When first implementing Opsgenie, there are several components you need to set up. You can learn the basics required to get you started below.
Set up your profile
When setting up your profile, you need to define contact methods and your time zone. Once this is set, you can move to the Notifications tab to begin setting your Notification Rules. These notifications are managed on a user level, with rules outlined individually for each user. You can set multiple rules for each user to be executed in hierarchical order.
When setting your rules you can define notification channels according to alert content, time of day, or time since sending. This enables you to customize rules to ensure that events are handled with the right priority.
Defining teams allows you to set escalation paths for your on-call staff.
Like notification rules, escalation paths order can be customized according to alert content or time of day. Teams can also help you ensure that the burden of responsibility for events is distributed as equally as possible. Each of your teams can have an admin who is responsible for managing the system for those users.
When setting up your teams, there are three components you need to account for:
- Routing rules—define when escalations should be triggered and when.
- Escalations policies—define which team member should be notified and how missed alert notifications should be handled.
- Schedules—defines the order for on-call staffing at any given time. It also determines how on-call shifts are distributed throughout your team. Schedules are automatically created when teams are created but can be customized as needed.
Set up Opsgenie integrations
Once your teams are defined you can begin integrating any existing tools you want to use, including New Relic, DataDog, and Prometheus on the monitoring side, and Slack or JIRA on the alert distribution side.
This can be done through a variety of built-in integrations, or generic API integrations. Once integrated, default rules are applied and can be customized as needed.
After your various components are set up, most of the alert management process is automated. Alerts are sent through your defined channels to the appropriate team member and are escalated as needed. The alert is also tracked until the alert is acknowledged or closed. From the interface, you can monitor alert progress and make any manual changes necessary.
Opsgenie On-Call Best Practices
When implementing Opsgenie, the following best practices can help you ensure that your notifications are properly configured and provide relevant information.
Set Up Alert Notification Rules
Notification rules enable you to classify alerts and send notifications according to priority. You should also leverage these rules to send different kinds of notifications appropriate to the content. For example, high priority alerts may trigger a combination of voice and mobile notifications. Meanwhile, low priority alerts may only require one notification method, such as email or SMS.
Access to the Right Tools and Correct Permissions
Whoever is receiving your notifications needs to be able to assess and respond to the source of the alert. This typically means recipients need admin or superuser access to your systems and need to understand how various management processes work. This enables them to diagnose, troubleshoot, and correct the issue without escalating to another staff member.
Train Engineers on the Relevant Diagnostic Tools
Like having the right permissions, your responders need to understand which sources of information alerts may stem from and what those tools are designed for. These tools can vary widely, but some commonly used options include:
- New Relic Insights—enables you to query events and retrieve metrics using the New Relic Query Language (NRQL). You can use it for segmentation, drill down, databases, customer experience, performance analysis, and Kubernetes monitoring.
- Amazon CloudWatch—enables you to monitor your AWS services and provides insight into operational health, application performance, and resource utilization. You can use CloudWatch through the AWS Console or ingest data with another solution.
- Graylog—enables you to collect and manage your logs. You can use it to analyze data from across your sources and visualize reports on an intuitive UI.
If you are using Opsgenie to notify multiple users of single alerts, such as through on-call schedules, you should streamline your escalation process. The simplest way to do this is with the Central Notification Template (CNT), enabling you to notify according to roles. However, keep in mind that you can only apply one CNT to each role, and users cannot modify their personal notification rules.
What Happens When Opsgenie Alerts Come In? Automating Your Response with StackPulse
Continuously monitoring the operation of software services is a fundamental part of both DevOps culture and of software engineering in general. Choosing the right monitoring tools for the job, and keeping the monitoring pipeline up-to-date with the software and infrastructure landscape, will build a foundation allowing you to provide stable and reliable software services.
While monitoring your systems, using tools like Opsgenie, enables the operations aspects of the DevOps culture, it is only a means, not a goal. The goal is to be able to provide uninterrupted service to users. Operations activities that ensure this goal is reached, based on monitoring data, are as critical as having monitoring in place. The ability to react to events at scale, before they affect the users of the service, is a critical business need.
StackPulse is a platform that enables you to automate, analyze and visualize operational aspects of Site Reliability Engineering. When an alert or a notification is sent, an automatic playbook is triggered by StackPulse, ensuring that a predictable set of operational steps is performed.
The StackPulse platform can help you perform the following actions automatically:
- Verifying the extent of the problem and its impact on components of the system—is it a mission-critical problem? How high would we prioritize handling it? Is this a false-positive or just a reflection of another issue that is already being handled?
- Checking potential causes for the problem, and either disqualifying them or identifying the direction for the root cause.
- Updating all relevant persons and systems with the latest state of the potential incident.
- Ensuring all investigation and remediation steps are audited and documented in order to enable efficient post-incident analysis.
- Automatically applying remediations and monitoring their effect on the system to achieve self-healing.
StackPulse ensures that engineering organizations get a high return on investment on monitoring infrastructure, by making every alert and notification count.
Get early access to the StackPulse platform and try it for yourself!