The Top 4 Multicloud Reliability Challenges for SREs

Lots of folks would have you believe that multicloud is the way to go – with good reason. There are lots of benefits of a multicloud architecture.

Yet if you’re an SRE, deploying workloads on multiple clouds isn’t all fun all the time. Although multicloud gives SREs more opportunities to increase reliability (by, for example, spreading workloads across multiple clouds), it also creates special challenges that they need to address in order to ensure reliability within a multicloud architecture.

This article discusses the top four challenges that SREs face when working with multicloud environments. A subsequent article will detail steps that they can take to overcome them.

Challenge 1: Multicloud Availability Hinges on Integration

One of the reasons why people get excited about multicloud is the idea that it can help achieve high availability.

That’s true. When you use multiple clouds, it becomes a lot easier to keep workloads available in the event that one of your clouds fails.

But what the conversation about availability often overlooks is that each cloud needs to be well integrated with the others in order to achieve high availability. Simply using more than one cloud at once does not magically ensure that your application never goes down. You need to engineer reliability into the multicloud architecture by integrating your clouds in such a way that if a service fails on one, an alternative will spin up in a different cloud.

You also need to plan for things like how you will perform disaster recovery from one cloud to another in the event that one cloud fails and you need to recover its workloads to a different one.

Achieving high data availability is complicated, too, especially because multicloud data redundancy needs to be balanced with cost. You typically can’t simply mirror all of your data across all of your clouds in hot storage tiers, unless storage costs are no object. Instead, you need to think strategically about which data you place where in order to keep critical data available without breaking the bank.

In short, while multicloud has the potential to unlock high availability, it requires SREs to solve a variety of challenges in order to leverage that potential.

Challenge 2: Orchestration Complexities

Many teams turn to orchestration tools like Kubernetes to help manage multicloud environments. That makes sense; by abstracting workloads from the underlying infrastructure, Kubernetes makes it a lot easier to manage them centrally even if they are spread across multiple clouds.

The challenge for SREs, however, is that the orchestration layer adds its own set of complexities. Instead of dealing only with cloud services and applications, SREs in a Kubernetes-based multicloud environment have to contend with the myriad moving parts within Kubernetes – not to mention auxiliary tools like container registries.

You could think of it this way: organizations use orchestration to simplify multicloud application deployment and management for developers and the IT team. But they leave it to SREs to pick up the slack by engineering the orchestration layer for reliability.

Challenge 3: More Clouds, More Tests

Although all major clouds offer the same core set of services, their implementations differ in ways that make it impossible in most cases to write a single set of tests that will cover all workloads on all cloud environments. AWS Lambda does serverless a bit differently than Azure Functions. IAM rules look and work differently on different clouds. Networking configurations vary. And so on.

For SREs, this means that there are more tests to write and run in order to achieve coverage across all cloud providers. This challenge is not insurmountable, but it makes the ability to streamline testing routines paramount.

Challenge 4: Limited Cloud Visibility

Even in a single-cloud environment, your ability to see what is happening deep inside applications is limited. You may not have access to server logs. You can only collect whichever metrics your cloud API services support. The same is true if you’re dealing with SaaS applications.

These visibility limitations become even greater when you move to multiple clouds. Not only can you collect only certain data about each cloud, but it’s difficult to correlate and compare the data between clouds because each cloud exposes somewhat different metrics and logs.

Here again, SREs need to work harder and more creatively to manage reliability in a multicloud environment.

In conclusion, multicloud offers lots of advantages, but it also makes the lives of SREs harder. The good news is that SREs can address the multicloud reliability challenges described above. But doing so requires a different approach to reliability than SREs take when working with a single cloud.

Read Previous Post
Read Next Post