Automated Kubernetes Operations and Troubleshooting
Delivering production applications with Kubernetes? Give your SRE, DevOps and Software Engineering teams a convenient and automated way to operate and troubleshoot their components. Provide easy visibility into production state, configuration changes and automate typical processes, such as understanding failure reasons, scaling up or down, and rolling back.
How StackPulse Helps
Operate production-grade applications in Kubernetes without having to train your team to become Kubernetes experts.
Kubernetes is a powerful, but very complex system. Most of the production Kubernetes clusters today are based on managed offerings (such as EKS, GKE, AKS, etc.) because they make building and using Kubernetes clusters easier. But when it comes to operating applications deployed on these clusters, troubleshooting, and understanding the reasons behind a failure, the majority of organizations still rely on dedicated skill-sets from their SRE, DevOps and Software Engineering teams. This isn’t a scalable approach.
The StackPulse Solution
Inspired by operational practices from the leading SaaS vendors, StackPulse gives you a set of tools that makes operating and troubleshooting your applications on Kubernetes easy and safe. Use the interfaces of your choice, such as Slack, and get information about failure reasons, production changes and more collected and delivered for you. Trigger typical mitigation scenarios, such as scale up/down, rollback and more in a safe and controlled environment.
StackPulse gives organizations running Kubernetes a powerful set of capabilities to augment their existing incident response practices.
Ready-made Kubernetes Operators
Based on learnings from a collective experience of Kubernetes operators, these ready-made patterns will make the life of SREs, DevOps and Software Engineers much easier, as they help handle typical operation and troubleshooting scenarios safely and efficiently. Below are just some examples that can be used out-of-the-box or modified to suit your individual needs:
- Automatically Retrieve and Analyze Crashed PoD Heap Dump
- Retrieve and analyze logs from a failed Cron Job
- Safely roll back deployments of specific services
- Safely modify parameters for Horizontal PoD Autoscalers (HPAs)
- Retrieve Logs from a Restarted Container and Compare to logs of a crashed one
- Automatically identify and surface misconfigurations and other problems on Kubernetes clusters
Visibility into Production Configuration State
StackPulse integrates with CI/CD, Progressive Deployment Operators, Kubernetes Audit Logs, and Monitoring Systems to provide a single pane of glass into all changes and events in the cluster, helping triage and troubleshoot problems. With StackPulse, you can:
- View all configuration change events
- Track and audit progressive deployments and rollbacks
- Follow the changes in modules that are being deployed
- Relate production alerts to configuration changes
Demo Videos: See Playbooks in Use
Need a little inspiration to get started? Check out these videos that showcase exactly how to use our playbooks to automate Kubernetes operations and reliability.
K8s Heap Dump: Video Demo
Watch a video on how to easily use the Heap Dump Playbook.
K8s Roll Back: Video Demo
Get a step-by-step demonstration of how to automate a roll back using a playbook.
K8s Service Scale: Video Demo
Watch a step-by-step demo of an automated K8s service scale up.
Closing the DevOps Infinity Loop
In this eBook you’ll learn the benefits of closing the DevOps infinity loop and achieving integration between reliability on one hand and application design and development on the other.