Managing Reliability for Monoliths vs. Microservices: The Challenges for SREs
What challenges does an SRE face today? The answer depends, in part, on which types of applications and environments the SRE must support. Managing a monolith presents different challenges than dealing with a microservices application.
That’s not to say that monoliths are necessarily easier to support. But it is to say that it’s worth understanding the different challenges surrounding microservices and monoliths insofar as they impact SREs.
Monolith Challenges for SREs
The idea that monoliths are simpler to manage (if they’re well-designed, at least) is pervasive. If you’ve never thought in depth about the topic, then you may assume that SREs who support monoliths have it relatively easy.
That’s far from the case. While it’s true that monoliths are architecturally simpler than microservices apps, monoliths present their own special challenges from a reliability engineering standpoint:
- Single points of failure: In a monolith, a bug in any part of the application can potentially bring down the whole application.
- Slower redeploys: When something goes wrong with your monolith, you need to redeploy the whole thing to apply a fix. The redeployment usually takes much longer than redeploying a single microservice.
- Harder to pinpoint problems: Transactions for a monolith are easier to trace in the sense that they don’t flow through so many different services, as they would in a microservices app. On the other hand, the lack of flow can make it harder to pinpoint the root cause of a problem in a monolith. You can’t say, “everything was fine until the trace got to service X, so that’s probably the root of the issue.”
- More bloat: Arguably, it’s easier to end up with bloat – meaning unnecessary code or features – in a monolith, because the large scale of the application makes it harder for developers to keep things lean and mean. Bloat increases the chance that something will go wrong. Even code that the app no longer uses could trigger a memory leak, for instance, and cause problems for the functional parts of the application.
These challenges translate into a few key considerations for SREs who manage monoliths. One is that minimizing MTTD and MTTR is perhaps even more important than it is when working with microservices. When a single failure could bring down your entire app, and it will take a long time to redeploy a fix, you need to find the problem and start the resolution process ASAP.
Another is that monitoring (meaning tracking surface-level application metrics and logs) assumes outsize importance when managing monoliths. Given that SREs can’t do deep, service-by-service traces in monoliths, they need to lean extra heavily on monitoring tools.
Microservices Challenges for SREs
Life doesn’t get any easier if you’re an SRE managing a microservices environment. You just face different challenges:
- Complex traces: Perhaps most obvious is that, because microservices apps consist of a number of individual services, tracing a transaction across them is complicated. You need to monitor each service while also understanding how services interact.
- Backend-frontend mappings: In a microservices app, it’s not always clear how a transaction that originates on the frontend impacts the backend, and vice versa.
- More services, more problems: The more microservices you have, the greater the number of places where something can go wrong – and, by extension, the higher the number of alerts you are likely to be contending with.
- Complicated redeploys: It may be faster to redeploy an individual microservice than an entire monolithic app, but the tradeoff is that with microservices, it can be tricky to figure out which services need to be redeployed to fix an issue. You don’t want to redeploy more than are necessary, but you also don’t want to forget to update any services that have an issue.
The main takeaway here for SREs is that mere monitoring doesn’t suffice when dealing with monoliths. SREs in this context need so-called observability, which goes deeper by tracing how transactions flow within a microservices environment and linking performance or reliability issues to the individual microservices that cause them.
Also important for SREs handling microservices apps is the ability to divide and conquer. With so many different services to manage, it’s critical to be able to identify the most serious issues and delegate responsibility for resolving them to different team members.
You might argue, too, that SREs who support microservices applications need to be able to collaborate even more closely with the rest of the development team than do their monolith-supporting counterparts. It’s only by working with product developers that they’ll know which microservices are the most important. Some parts of the app might be less critical from a business perspective than others, and SREs need to understand the differences so they know what to prioritize.
In short, monolithic and microservices apps both present unique challenges for SREs. By extension, the SREs who are responsible for these apps need to operate in different ways. We’ve touched on those differences a bit here, but we will take a deeper dive into SRE best practices for working with microservices as compared to monoliths in a follow-up post.