VictorOps Basics and a 5-Step Quick Start Guide
VictorOps is a tool you can use for involving engineering and operations leaders in handling production incidents. When something goes wrong with IT systems, VictorOps notifies the right people and ensures they are working to resolve the problem.
Key features of VictorOps include:
- Scheduling and escalations tools enabling streamlined notifications
- Customizable alert rules to reduce noise
- Centralized view for alert data
- Mobile notifications with acknowledge, resolve and reroute, via native iOS and Android apps
In this article, you will learn:
- What are VictorOps Rotations and Shifts?
- VictorOps Alert Behavior
- Routing Keys in VictorOps
- Routing Rules in VictorOps
- Rerouting an Incident
- VictorOps Notifications
- Notification Aggregation
- Getting Started with VictorOps in 5 Steps
- What Happens When VictorOps Alerts Come in? Automating Your Response
What are VictorOps Teams?
A team is a group of users in VictorOps that groups users with schedules and escalation policies.
What are VictorOps Rotations and Shifts?
After creating a team in Victor Ops, admins can create an on-call schedule. A schedule has three parts: rotations, shifts, and escalation policies. Rotations are recurring schedules with one or more shifts. There are three types of shifts: partial day, 24/7, and multi-day.
VictorOps Alert Behavior
VictorOps alerts help you manage incidents and ensure the right people are notified about them, acknowledge them, and resolve them.
Routing Keys in VictorOps
Alert routing allows admins to assign specific alerts to specific groups. Routing Keys are tools that allow you to assign specific alerts to specific team members, and resolve specific cases, without informing the entire team.
Routing Rules in VictorOps
When VictorOps generates an alert, the routing rules specify who should receive the alert.If the incident has a specific escalation policy, the policy determines who is initially notified, and to whom the incident should be escalated if it is not acknowledged.
Rerouting an Incident
Sometimes, a user receiving a notification may need to reroute the incident to other teams or specific individuals. VictorOps can route incidents to individuals, groups, or escalation policies.
- When an incident is delivered directly to a user, the user receives a notification according to their individual paging policy, until they acknowledge.
- When an incident is routed to an escalation policy, or multiple policies, the appropriate individuals are notified, and the incident is escalated, according to each policy.
- Make sure the team members have configured escalation policies correctly, ensuring that important incidents don’t end up in a collective email box, but are directly paged to the relevant staff.
VictorOps notifications are the mechanism by which staff get informed about important incidents. Incidents are typically created as a result of alerts generated by monitoring tools, which are integrated with VictorOps.
Related content: read our guide to DevOps monitoring
Push notifications are sent through the native mobile apps. The following items can be pushed to the app, and users can view each notification, forward it to others, or report it:
- On-call changes
- Control calls
SMS notifications have a maximum of 160 characters. They show an incident ID, incident display name, and if two-way SMS is supported, response codes. The user can text one of the reply codes to provide updates about the incident.
VictorOps can also use email for incident notifications, to report gaps in schedules, or to report if VictorOps itself is in maintenance mode.
Phone notifications read out the incident display name, and allows the user to acknowledge oer resolve the alert by entering a response code.
When many alerts are sent at the same time, it is important to send them together, to avoid overwhelming on-call teams. To limit the number of notifications, Victor Ops automatically aggregates notifications when it detects an “alert storm”. It is also possible to mute all alerts or specific alerts.
Aggregation is performed after three different incidents are opened within one minute. From that point onwards, notifications are only sent one per minute, and show the total number of incidents of the same type that have been issued.
Getting Started with VictorOps in 5 Steps
Here is a brief process you can use to start implementing VictorOps for your organization.
1. Add users
The first and most important step in configuring VictorOps is adding users. To add a new user:
- Select User > User invitation and add the user’s email
- Add users using the API by selecting Integration > API
2. Create Teams
A team consists of a list of users, and manages on-call shifts and escalations. Create a team by selecting Teams > Add Team, and adding a team name.
3. Create Rotations, Escalation Policies, and Routing Keys
Select Integrations to connect VictorOps to your existing infrastructure. Each integration can feed alerts into VictorOps, enabling it to create incidents and notify staff. See a full list of VictorOps integrations.
5. Rules Engine
If necessary, define a Rules Engine, which lets you define specific conditions to trigger custom actions. For example, it can add specific types of information to alerts to support organizational processes or add data that is needed for incident responders.
What Happens after an Alert? Automating your Response with StackPulse
StackPulse automates and orchestrates your on-call response, by turning your team’s operations into code. StackPulse can both help reduce the inbound
The StackPulse platform can help you perform the following actions automatically:
- Verifying the extent of the problem and its impact on components of the system—is it a mission-critical problem? How high would we prioritize handling it? Is this a false-positive or just a reflection of another issue that is already being handled?
- Checking potential causes for the problem, either disqualifying them or identifying the direction for the root cause.
- Updating all relevant persons and systems with the latest state of the potential incident.
- Ensuring all investigation and remediation steps are audited and documented, in order to enable efficient post-incident analysis.
- Automatically applying remediations and monitoring their effect on the system to achieve self-healing.
StackPulse ensures that engineering organizations get a high return on investment on monitoring infrastructure, by making every alert and notification count.
Get early access to the StackPulse platform and try it for yourself!