Automated Kubernetes Pod Restarting Analysis with StackPulse
Customer-facing incidents and outages often get the majority of attention, but the low severity alerts that get raised in masses every day can be just as important when working to improve overall reliability.
Just because an alert has been flagged as low severity, and doesn’t have an immediate impact on the end-users, doesn’t mean the underlying cause can be safely ignored. It could be the first sign of growing service instability, or a key factor in a critical incident yet to come.
But trying to analyze and triage every low severity alert comes with downsides — and is, sometimes, simply impossible. Increased toil and alert fatigue, decreased focus on the major incidents, and reduced time for innovation and overall reliability are just a few of the issues that can arise from time spent overanalyzing low severity alerts.
So every team is left with a risk-balancing exercise: How do you effectively focus on the most important things (critical incidents, innovation), while ensuring the small details (low severity alerts) don’t end up causing you even more pain?
At StackPulse, we’ve struck this balance with our reliability as code approach by automating the manual analysis steps for low severity alerts. This saves time our on-call developers may have spent on toil, helps us identify issues before they become incidents, and reduces the amount of time we spend fighting fires or buried in logs for low severity alerts.
An example of this is our code-based playbook: Kubernetes Pod Restarting Analysis.
Pod Restarting Analysis
If you use Kubernetes like we do, you may be seeing periodic pod restarting alerts. You’re probably not waking up in the middle of the night each time an alert fires for a restarting pod, but you’re also not just ignoring these — as it could be the symptom of a deeper issue. Of course, it could be just a hiccup that your orchestrator fixes with no further action required. But determining which one of these is the issue is the tricky part.
The troubleshooting itself is fairly straightforward – there are a few options – but in general, the process looks something like this:
kubectl get podsto take a look at all running containers on certain Kubernetes nodes, trying to see if there is a dependency
- Get log files from the relevant container with
kubectl logs podname– both the current and previous instances.
Those steps are straightforward to run once, but repeating them every time a pod restart occurs is not something anyone could be expected to do. There are also additional questions to consider:
- First, the on-call engineer needs to know what commands to run, and have a working knowledge of Kubernetes to understand the container<>pod relationship.
- Second, the on-call developer has to have production access to run kubectl, or be able to escalate to someone who does.
- Third, even if 1 and 2 go smoothly, there are still a few minutes worth of work every time a low severity pod restarting alert is fired. That adds up pretty quickly — consuming a lot of developer time.
What we’ve done at StackPulse is eaten our own dogfood. We don’t expect every developer on the team to learn the ins and outs of running Kubernetes, nor do developers have direct access to production — but every member of our team rotates through on call.
We’ve built a playbook that takes the above process and fully automates it, providing the on-call developer with the necessary logs to identify the root cause of a consistent pod restarting alert and triage it appropriately. We don’t spend valuable time sorting out production access, then some more time consulting a runbook in Wiki, and eventually even more time executing commands. Instead, the on call developer is delivered the logs automatically via a code-based playbook that runs in StackPulse.
This playbook automatically triggers whenever we get a Kubernetes pod restarting alert. It fetches the running containers, and then collects logs from the current and previous instances of those containers, delivering them to our on call developer via Slack.
Each time this runs, it saves between 10-30 minutes of developer time while still giving us the insight into what minor issues might lead to incidents later on, so we can proactively fix them. This helps our teams build a more reliable service for our customers, while staying free from toil and endless manual work for common alerts
The Kubernetes Pod Restarting Analysis playbook is only one of many out of the box playbooks you can find in our playbook repository.
- Check out more pre-built playbooks to help you save time and deliver more reliable services
- Learn more about the benefits of code-based, executable playbooks for incident response.
- Get started with the free edition of the StackPulse Reliability Platform