VictorOps Basics and a 5-Step Quick Start Guide

VictorOps is a tool you can use for involving engineering and operations leaders in handling production incidents. When something goes wrong with IT systems, VictorOps notifies the right people and ensures they are working to resolve the problem. 

VictorOps (or a similar system) is a common part of the DevOps toolset, alongside other on-call scheduling tools like PagerDuty or OpsGenie

Key features of VictorOps include:

  • Scheduling and escalations tools enabling streamlined notifications
  • Customizable alert rules to reduce noise
  • Centralized view for alert data
  • Mobile notifications with acknowledge, resolve and reroute, via native iOS and Android apps

In this article, you will learn:

  • What are VictorOps Rotations and Shifts?
  • VictorOps Alert Behavior
    • Routing Keys in VictorOps
    • Routing Rules in VictorOps
    • Rerouting an Incident
  • VictorOps Notifications
    • Notification Aggregation
  • Getting Started with VictorOps in 5 Steps
  • What Happens When VictorOps Alerts Come in? Automating Your Response

What are VictorOps Teams?

A team is a group of users in VictorOps that groups users with schedules and escalation policies.

What are VictorOps Rotations and Shifts?

After creating a team in Victor Ops, admins can create an on-call schedule. A schedule has three parts: rotations, shifts, and escalation policies. Rotations are recurring schedules with one or more shifts. There are three types of shifts: partial day, 24/7, and multi-day.

VictorOps Alert Behavior

VictorOps alerts help you manage incidents and ensure the right people are notified about them, acknowledge them, and resolve them.

Routing Keys in VictorOps

Alert routing allows admins to assign specific alerts to specific groups. Routing Keys are tools that allow you to assign specific alerts to specific team members, and resolve specific cases, without informing the entire team.

Routing Rules in VictorOps

When VictorOps generates an alert, the routing rules specify who should receive the alert.If the incident has a specific escalation policy, the policy determines who is initially notified, and to whom the incident should be escalated if it is not acknowledged. 

Rerouting an Incident

Sometimes, a user receiving a notification may need to reroute the incident to other teams or specific individuals. VictorOps can route incidents to individuals, groups, or escalation policies.

  • When an incident is delivered directly to a user, the user receives a notification according to their individual paging policy, until they acknowledge.
  • When an incident is routed to an escalation policy, or multiple policies, the appropriate individuals are notified, and the incident is escalated, according to each policy. 
  • Make sure the team members have configured escalation policies correctly, ensuring that important incidents don’t end up in a collective email box, but are directly paged to the relevant staff.

VictorOps Notifications

VictorOps notifications are the mechanism by which staff get informed about important incidents. Incidents are typically created as a result of alerts generated by monitoring tools, which are integrated with VictorOps.

Related content: read our guide to DevOps monitoring

Push Notifications

Push notifications are sent through the native mobile apps. The following items can be pushed to the app, and users can view each notification, forward it to others, or report it:

  • Paging
  • On-call changes
  • Chats
  • Control calls

SMS

SMS notifications have a maximum of 160 characters. They show an incident ID, incident display name, and if two-way SMS is supported, response codes. The user can text one of the reply codes to provide updates about the incident.

Email

VictorOps can also use email for incident notifications, to report gaps in schedules, or to report if VictorOps itself is in maintenance mode.

Phone

Phone notifications read out the incident display name, and allows the user to acknowledge oer resolve the alert by entering a response code.

Notification Aggregation

When many alerts are sent at the same time, it is important to send them together, to avoid overwhelming on-call teams. To limit the number of notifications, Victor Ops automatically aggregates notifications when it detects an “alert storm”. It is also possible to mute all alerts or specific alerts. 

Aggregation is performed after three different incidents are opened within one minute. From that point onwards, notifications are only sent one per minute, and show the total number of incidents of the same type that have been issued. 

Getting Started with VictorOps in 5 Steps

Here is a brief process you can use to start implementing VictorOps for your organization.

1. Add users

The first and most important step in configuring VictorOps is adding users. To add a new user:

  • Select User > User invitation and add the user’s email
  • Add users using the API by selecting Integration > API

2. Create Teams

A team consists of a list of users, and manages on-call shifts and escalations. Create a team by selecting Teams > Add Team, and adding a team name.

3. Create Rotations, Escalation Policies, and Routing Keys

We defined these concepts above. Create at least one Rotation, Escalation Policy and Routing Key

4. Integrations

Select Integrations to connect VictorOps to your existing infrastructure. Each integration can feed alerts into VictorOps, enabling it to create incidents and notify staff. See a full list of VictorOps integrations.

5. Rules Engine

If necessary, define a Rules Engine, which lets you define specific conditions to trigger custom actions. For example, it can add specific types of information to alerts to support organizational processes or add data that is needed for incident responders.

What Happens after an Alert? Automating your Response with StackPulse

StackPulse automates and orchestrates your on-call response, by turning your team’s operations into code. StackPulse can both help reduce the inbound

The StackPulse platform can help you perform the following actions automatically:

  • Verifying the extent of the problem and its impact on components of the system—is it a mission-critical problem? How high would we prioritize handling it? Is this a false-positive or just a reflection of another issue that is already being handled?
  • Checking potential causes for the problem, either disqualifying them or identifying the direction for the root cause.
  • Updating all relevant persons and systems with the latest state of the potential incident.
  • Ensuring all investigation and remediation steps are audited and documented, in order to enable efficient post-incident analysis.
  • Automatically applying remediations and monitoring their effect on the system to achieve self-healing.

StackPulse ensures that engineering organizations get a high return on investment on monitoring infrastructure, by making every alert and notification count. 

Get early access to the StackPulse platform and try it for yourself!

Read Previous Post
Read Next Post