Designing Highly Available Azure Solutions

The uptime service-level agreement (SLA) for cloud-based solutions is sometimes overshadowed by the cloud provider’s availability commitments for its managed services.

However, solution providers also have a responsibility towards achieving high availability (HA) requirements, by incorporating resiliency and redundancy into their cloud-based solutions being built using the managed services.

In this article, we will look at some of the key considerations when designing highly available Azure solutions. We will not be looking into the native HA features of Azure managed services themselves, instead, we will focus on the design aspects at the application-level.

The Acceptable Availability Level

Highly available solutions come with a higher price tag and an increased complexity, so designing a cloud-based solution with arbitrary availability level that subjectively seems good enough could end up being either unnecessarily high, or even worse, it could be low for the given the solution context.

Finding an acceptable availability level for the target solution requires due diligence in exploring the solution’s requirements and constraints such as: business needs, solution’s criticality, regulatory constraints, budget constraints, and organizational and industry availability expectations.

During this exploratory exercise, it is very important to meet with the right stakeholders in the organization to collaborate in discovering the acceptable availability level, to assess its rationale, and importantly, to have a shared understanding among the business and technical stakeholders about what that availability level means.

Solution High Availability

If Azure’s single region uptime SLAs for the underlying Azure services is considered acceptable for the business needs and constraints; then we should shift our focus to the availability and resiliency of the solution itself, so as not to degrade the solution’s uptime SLA maintained by the underlying services.

Generally, attaining quality attributes for cloud-based solutions is a shared responsibility between the cloud provider for the foundation, and the solution providers for the hosted solutions. For example, leveraging a highly scalable Azure App Service plan with a hosted application that is not multi-instance aware will lead to a solution that is unable to scale.

Availability is no different, the application itself should be designed with resiliency against possible internal and external failures. Internal failures are the application code and configuration issues, while external failure are due to the dependency on external components, such as IAM issues, integration errors, network errors.

The solution’s availability can be increased by reenforcing its end-to-end resilience and by introducing redundancy to mitigate the aforementioned failures.

End-to-End Resilience

Cloud-based solutions are highly distributed by nature. Designing a highly available cloud-based solution requires resiliency capabilities to be baked into all of its moving parts. These capabilities should even extend to the consuming clients/parties components, after all, these constitute a part of the overall solution and therefore contribute to its resiliency.

The distributed solution resiliency is as strong as its weakest component. In other words, a constituent component with low or no built-in resiliency and that has a typical interdependency in a solution, will negatively impact the overall solution resiliency and hence will decrease its availability.

The resilience of the solution components boils down to application-level code quality and to its techniques for graceful and timely recovery from the internal and external failures.

The graceful recovery is achieved with proper exception handling, retry patterns to tolerate transient failures, and for persistent failures, monitoring for expected & unexpected failures, and conducting failover to a healthy replica would be key to a timely recovery, which brings us to the redundancy aspect.

Solution Redundancy

The solution’s availability is also increased with redundancy, that is, by provisioning one or more passive replica(s) that have the entire solution components as separate Azure instances.

This is especially important to the solution’s availability in case of persistent failures, so failing over to a stable, passive replica would allow us to fix this type of issues within the Recovery Time Objective (RTO).

Again, here we assume that Azure uptime SLAs in a single region are considered acceptable against the agreed upon availability level.

One advice here is to consider having a consistent redundancy strategy. Imagine the challenge of trying to manage the different redundancy features across Azure services and to coordinate these disparate features during the failover process. Having a consistent redundancy strategy will lower the solution’s operations complexity during the failover and failback processes, which in turn will contribute to increasing the solution’s availability.

Redundancy Technical Considerations

Here I will touch on some of the key technical aspects associated with the design and operation of redundant instances:

  • Routing: There are many mechanisms in Azure that can route upstream HTTP traffic between the solution replicas. Starting with lightweight options like managing DNS records in Azure DNS, or using Azure Functions Proxies, all the way to leveraging feature-rich routing services such as Azure Traffic Manager and Azure Front Door.
  • Monitoring: Azure Monitor should be leveraged to monitor specific solution’s logs and/or metrics, some of which could be used to trigger the failover process.
  • Codebase & Automation: Well-thought branching and versioning strategies should be devised for the application code to maintain the code for the primary and secondary replica(s), in addition to a CI/CD plan will be required for automating testing and releasing of changes to these different replicas.
  • Drills: Conducting failover and failback drills is crucial to validate the solution’s RTO, the stability of the replicas, and the reliability of the routing mechanism.
  • Data: Stateful solutions will require an added complexity to handle the two-way data synchronization between different data store instances across the replicas to ensure data consistency during failover and failback.
  • Security Posture: Hardening the solution’s security posture can mitigate malicious cyber attacks; like DDOS, or ransomware which could impact the solution availability.
  • Zero down time is very hard, if not impossible, to achieve. Instantaneous failover is unrealistic, instead, minimizing RTO should be targeted inline with the actual requirements. There still will be some downtime, however small, during the failures preceding the failover, and during the failover process itself.

Performance can also Impact Availability

Finally, performance issues which usually rear its head in production environments, can potentially introduce availability problems to the solution’s internal and external dependent components.

For example, a downstream service instance that leverages an Azure entry-level SKU or one that lacks performance-related code optimization, will be unable to properly scale to accommodate the latency and/or throughput requirements by the upstream dependent components.

These performance issues usually manifest as timeout failures and temporary limited access to the downstream systems resulting in a decrease of the solution’s availability.