The SRE’s Guide to Kubernetes Troubleshooting

If you don’t run Kubernetes reliably, Kubernetes can’t run your applications reliably. That’s where Site Reliability Engineers, (SREs) come into the Kubernetes picture. As experts in reliability engineering, SREs must assume responsibility for guaranteeing the reliability of Kubernetes itself – and, in turn, laying the foundation for the reliability of the entire microservices-based software stack.

The SRE’s Guide to Kubernetes Troubleshooting highlights common principles for running Kubernetes reliably that, when applied, help you not only avoid many common errors, but also help you respond to alerts and incidents faster. This guide contains:

  1. Tips on how to configure a more resilient implementation of Kubernetes
  2. Tools to easily troubleshoot and fix common issues with Kubernetes environments
  3. Examples of playbooks that you can run to remediate complex Kubernetes error



