Top 10 SRE Books to Read in 2021

Site Reliability Engineering (SRE) is a rapidly-growing discipline. While SRE originated at tech giants like Google and Facebook, SRE principles are now being practiced at companies of all sizes. Still, SRE is an emerging discipline, and there are many different philosophies on how to implement it for success.
In this blog, I share a list of the top 10 SRE books that helped me understand SRE core tenants, practical implementation, and the relationship between SRE and other methodologies like DevOps. Hopefully, this list will help any of you out there who may be beginning (or in the middle of) your journey to understand the core principles of SRE better.
Essential books for SRE:
- Site Reliability Engineering (by Google)
- The Site Reliability Workbook: Practical Ways to Implement SRE
- Real World SRE
- Seeking SRE: Conversations About Running Production Systems at Scale
- Practical Reliability Engineering
- The Field Guide to Understanding ‘Human Error’
- The Practice of Cloud System Administration: DevOps and SRE Practiced for Web Services
- Continuous Delivery: Reliable Software Released through Build, Test, & Deployment Automation
- Platform Revolution
- The Systems Bible: The Beginner’s Guide to Systems Large and Small
Site Reliability Engineering (by Google)
Author: Betsy Beyer, Chris Jones, Jennifer Petoff & Niall R. Murphy How to buy: Google This book is the central reference for the SRE field. It brings together principles, practices, and examples Google’s teams use to improve scalability, stability, and efficiency. Each chapter presents a primary function of the SRE role and how that function improves reliability. This SRE book is less of a practical guide—but more an overview of SRE culture, and the importance of adopting the SRE disciplineThe Site Reliability Workbook: Practical Ways to Implement SRE
Author: Betsy Beyer, Niall R. Murphy, David K. Rensin, Kent Kawahara & Stephen Thorne How to buy: Amazon The long-awaited sequel to Google’s Site Reliability Engineering (2016) extends the previous version with hands-on advice and real-world examples of SRE. The authors aimed to add implementation details to the principles outlined in part one of the book and, more importantly, to change the notion that SRE is only relevant for Google itself or Google-scale organizations.Real-World SRE
Author: Nat Welch How to buy: Amazon Real-World SRE provides a step-by-step guide to dealing with system failures. It defines tools, strategies, and practices to help implement a more active SRE culture. This can help you to uproot your existing incident response practices and recover from failure faster than ever. What you can learn from the book:- How to monitor the likelihood of complete system failure
- How to alert your team to outages
- Understanding incident response strategies
- Leveraging test automation and software development skills for SRE
- Overcoming bottlenecks that affect the user experience.
- Succeeding in SRE interviews
Seeking SRE: Conversations About Running Production Systems at Scale
Author: David N. Blank-Edelman How to buy: Amazon Seeking SRE is a collection of essays about managing Google’s production systems. Here’s what you’ll learn from this book:- Ways of applying SRE and SRE principles in different settings
- Relationship between SRE and other methods (for example, DevOps)
- Technology that can enable more straightforward implementation of SRE
- New practices that will be popular in SRE soon
- The human side of SRE practices