Top 10 SRE Books to Read in 2021

Site Reliability Engineering (SRE) is a rapidly-growing discipline. While SRE originated at tech giants like Google and Facebook, SRE principles are now being practiced at companies of all sizes. Still, SRE is an emerging discipline, and there are many different philosophies on how to implement it for success.

In this blog, I share a list of the top 10 SRE books that helped me understand SRE core tenants, practical implementation, and the relationship between SRE and other methodologies like DevOps. Hopefully, this list will help any of you out there who may be beginning (or in the middle of) your journey to understand the core principles of SRE better.

Essential books for SRE:

  1. Site Reliability Engineering (by Google)
  2. The Site Reliability Workbook: Practical Ways to Implement SRE
  3. Real World SRE
  4. Seeking SRE: Conversations About Running Production Systems at Scale
  5. Practical Reliability Engineering
  6. The Field Guide to Understanding ‘Human Error’
  7. The Practice of Cloud System Administration: DevOps and SRE Practiced for Web Services
  8. Continuous Delivery: Reliable Software Released through Build, Test, & Deployment Automation
  9. Platform Revolution
  10. The Systems Bible: The Beginner’s Guide to Systems Large and Small

Site Reliability Engineering (by Google)

Author: Betsy Beyer, Chris Jones, Jennifer Petoff & Niall R. Murphy

How to buy: Google

This book is the central reference for the SRE field. It brings together principles, practices, and examples Google’s teams use to improve scalability, stability, and efficiency.

Each chapter presents a primary function of the SRE role and how that function improves reliability. This SRE book is less of a practical guide—but more an overview of SRE culture, and the importance of adopting the SRE discipline

The Site Reliability Workbook: Practical Ways to Implement SRE

Author: Betsy Beyer, Niall R. Murphy, David K. Rensin, Kent Kawahara & Stephen Thorne

How to buy: Amazon

The long-awaited sequel to Google’s Site Reliability Engineering (2016) extends the previous version with hands-on advice and real-world examples of SRE. The authors aimed to add implementation details to the principles outlined in part one of the book and, more importantly, to change the notion that SRE is only relevant for Google itself or Google-scale organizations.

Real-World SRE

Author: Nat Welch 

How to buy: Amazon

Real-World SRE provides a step-by-step guide to dealing with system failures. It defines tools, strategies, and practices to help implement a more active SRE culture. This can help you to uproot your existing incident response practices and recover from failure faster than ever.

What you can learn from the book:

  • How to monitor the likelihood of complete system failure
  • How to alert your team to outages
  • Understanding incident response strategies
  • Leveraging test automation and software development skills for SRE
  • Overcoming bottlenecks that affect the user experience.
  • Succeeding in SRE interviews

Seeking SRE: Conversations About Running Production Systems at Scale

Author: David N. Blank-Edelman

How to buy: Amazon

Seeking SRE is a collection of essays about managing Google’s production systems.

Here’s what you’ll learn from this book:

  • Ways of applying SRE and SRE principles in different settings
  • Relationship between SRE and other methods (for example, DevOps)
  • Technology that can enable more straightforward implementation of SRE
  • New practices that will be popular in SRE soon
  • The human side of SRE practices

Practical Reliability Engineering

Author: Patrick P. O’Conner & Andrew Kleyner

How to buy: Amazon

Practical Reliability Engineering introduces advanced theoretical concepts, practical applications, and industry best practices around reliability. It takes a holistic approach that will be useful for many different roles, but specifically for SREs, the book provides in-depth chapters about availability, software reliability, reliability data analysis, maintenance and maintainability.

The Field Guide to Understanding ‘Human Error’

Author: Sidney Dekker

How to buy: Amazon

Accepting failure is a core belief in the SRE mindset, but it can be challenging to explain the concept and gain business leaders’ acceptance. In its third edition, this book aims to change how organizations think about “human error” problems, including accidents, safety systems, and event post mortem procedures.

The Practice of Cloud System Administration: DevOps and SRE Practiced for Web Services

Author: Thomas A. Limoncelli, Strata R. Chalup & Christina J. Hogan

How to buy: Amazon

With the move to the cloud, organizations must embrace DevOps / SRE practices to move their IT departments forward. This book provides case studies of state-of-the-art production systems (for example, at Netflix and Amazon) and explains why distributed systems have fundamentally different system management principles – which cloud service providers may not be able to provide.

Continuous Delivery: Reliable Software Released through Build, Test, & Deployment Automation

Author: Jez Humble & David Farley

How to buy: Amazon

In this classic work by Humble and Farley, they outline strategic principles and practices to achieve continuous software delivery. For developers not entirely comfortable with DevOps, this is a good primer for how DevOps principles impact overall reliability.

Platform Revolution

Author: Geoffrey G. Parker, Marshall W. Van Alstyne, & Sangeet Paul

How to buy: Amazon

Cloud paradigms like SaaS, IaaS, and XaaS are disrupting the technology space, but little is known about the mechanisms and behaviors behind each of the paradigms. This book explores the Platform as a Service paradigm, studying its historical background, operational strategy, and economic significance. This is a good foundation for IT professionals looking to understand the challenges site reliability engineering was created to solve. 

The Systems Bible: The Beginner’s Guide to Systems Large and Small

Author: John Gall

How to buy: Amazon

A systems engineering book that helps the reader achieve a structured understanding of system errors, claiming that errors are a basic function of any system. For SREs, the book provides 40 chapters that discuss the benefits of systems assumed to fail. The book also explains how to measure, optimize, and manage systems at any scale—essential practices to implement SRE disciplines.

Want to skip reading these SRE books and jump right into a demo with StackPulse? Contact us here.

Read Previous Post
Read Next Post