Implementing Business Continuity on Azure

There is a general misconception among cloud consumers that the availability of their resources in the cloud is always guaranteed. This is not true since all cloud providers, including Microsoft, offer specific SLAs for their products that almost never reach an availability target of 100%. For the consumers who have deployed critical resources and applications to the cloud, reaching the company-defined targets for Business Continuity can be technically challenging and confusing. The purpose of this blog post is to provide practical guidance on how Business Continuity is expressed on the cloud, how it can be implemented for many Azure IaaS and PaaS services and what real-world problems each solution attempts to solve.

Introduction

Before we dive into the technical Azure-specific details, let’s explain what Business Continuity is and what it involves.

Business Continuity is the capability of the organization to continue the delivery of products or services at acceptable predefined levels following a disruptive incident. According to ISO 22301, business continuity is not limited only to IT and it involves many enterprise aspects.

In this blog post, we will focus on the business continuity aspects related to IT. Each of them corresponds to a specific type of SLA that you may have internally or with your customers, so there are multiple aspects of Business Continuity that may be applicable to you.

If it’s important for you to keep your services always up and running, you should focus on High Availability. This is the ability of a system to be continuously operational, or, in other words, have an uptime percentage of near 100%. It is generally achieved by implementing redundant, mirrored copies of the hardware and data, so that if one component fails, another one takes over.

If fluctuating demand and bottlenecks cause your systems to struggle, then you may need to focus on Scalability. This is the ability of a system to scale up or scale down cloud resources as needed to meet fluctuating demand. It can be considered as an aspect of Business Continuity, since peaks in demand can be the result or the cause of an incident.

Finally, to protect data that are critical to your company’s functionality and need to be always available and recoverable, you should implement Backup. This is the duplication of data to a secondary location, so that if the primary copy is harmed or becomes unavailable, data from the other location can be retrieved and the system can be rolled back to a specific point in time.

The following diagram shows an analogy between the aforementioned terms and the problems they tackle.

*Business Continuity: Mapping of problems and solutions*

It is important to note that the implementation of any of the controls described in this blogpost should be based on a structured business continuity assessment/plan, and should be selected based on the requirements of your environment or application. Improvident implementation of controls could result in undue costs or in inefficient protection.

Implementing High Availability

Depending on the required uptime of your application or system and the scale of disaster you need to be able to recover from, there are many ways to implement high availability in Azure. When choosing the controls that will be implemented in your environment, you should always consider that the availability in a chain of resources is determined by the weakest link in the chain. For example, in the case of an application composed by a front-end server and a database, if the web server is spread across multiple availability zones but the database is single-instance, the whole application will not be available anymore if the availability zone of the database goes down. Having the above in mind, we present below the different options provided by Azure, sorted by increasing complexity and costs.

Protection against hardware failures

Small-scale technical or hardware issues may affect single-instance components. To avoid this, the component should be mirrored to a secondary hardware volume. On Azure, depending on your cloud computing state, this can be implemented as follows:

IaaS

When the component is a Virtual Machine (VM), this can be achieved by using availability sets. An availability set is a logical grouping of VMs that allows Azure to understand how your application is built to provide for redundancy and availability. While for single-instance VMs Azure guarantees a 99,9% uptime SLA, by using availability sets the uptime is increased to 99,95%. To use availability sets on Azure VMs, you need to perform the following steps:

Create an availability set;
Create new VMs; in the creation wizard, under “Availability options” choose “Availability set” and then select the previously created set.

Note: It is not possible to add existing VMs to an availability set after their creation.

PaaS

Azure PaaS components are protected against local hardware failures by design, guaranteeing higher uptime SLAs than IaaS. Specifically:

Storage Accounts: Microsoft ensures 3 instances of the service when using the default redundancy option (Locally redundant Storage – LRS). This offers an SLA of 99.999999999% (11 nines) uptime.
SQL Databases: By default, Microsoft ensures at least two instances of the service within the same data center, reaching 99,99% uptime.
Cosmos DB: By default, Microsoft provides three replicas (individual nodes) within a cluster, ensuring an SLA of 99,99% uptime.
App Service: Microsoft guarantees an SLA of 99.95% uptime for App Services, for tiers other than Free or Shared.

Protection against datacenter failures

To provide the option of protecting against failures that affect the whole datacenter, such as fire, power and cooling disruptions or flood, Microsoft has introduced the concept of availability zones. Availability zones are unique physical locations within an Azure region, each made up of one or more datacenters with independent power, cooling, and networking. The creation of multiple instances of services across two or more zones provides increased high availability, as it protects both against hardware and against datacenter failures.

Azure Availability Zones
Source: What are Azure regions and availability zones? | Microsoft Learn

Based on your Cloud computing model, such protection can be achieved as follows:

IaaS

Virtual machines can be deployed across multiple availability zones to provide an uptime SLA of 99,99%. This can be done with the following steps:

Create a VM; under “Availability options” select “Availability zone” and specify a zone. This will be the primary zone of your VM.
Open the VM.
Under Operations, select “Disaster recovery” and set the option “Disaster Recovery between Availability Zones?” to “Yes”. Under “Advanced settings” you will be able to see or change the secondary zone.
Click on “Review and start replication”.

PaaS

When it comes to PaaS services, it is generally easier to deploy them across multiple availability zones. Specifically:

Storage Accounts:
- Microsoft ensures 3 instances of the service across three different availability zones (Zone redundant Storage – ΖRS). This offers an SLA of 99.9999999999% (12 nines) uptime over a given year.
- The option can be enabled during the Storage Account creation, under Basics – Redundancy.
SQL Databases:
- Zone redundancy available at the General purpose, Hyperscale, Business Critical and Premium Service tiers. The SLA depends on the tier and can reach 99.995% of uptime.
- The option can be configured during the SQL DB creation, under the Service tier selection menu, or under Settings – Compute + storage for existing databases.
Azure Cosmos DB Accounts:
- Enabling zone redundancy in an Azure Cosmos DB account can increase the uptime SLA to 99.995%.
- The option can be selected during the Cosmos DB Account creation, under Global Distribution – Availability Zones.
App Service:
- Zone redundancy is only available in either Premium v2 or Premium v3 App Service Plans for Web Apps and in Elastic P2 for Function Apps. At the time of writing, Microsoft has not published specific SLAs for zone-redundant App Services, but it guarantees at least three instances of the service.
- The option can be enabled during the creation of the Service Plan, under Zone redundancy.

Note: It is not possible to enable availability zone support after the creation of any of the above components.

Protection against regional failures

Finally, to protect against regional failures that can affect many adjacent datacenters and can be caused by large-scale natural and man-made disasters (e.g., earthquake, tornados, war), Microsoft has introduced the concept of availability regions. Azure regions are physical regions all over the world, designed to offer protection against local disasters within availability zones and against regional or large geography disasters by making use of another region and replicating the workloads to that region. The secondary region could be considered as the disaster recovery site. Availability regions can be used independently of availability zones or in conjunction with them.

*Azure Availability Regions*
*Source:* *What are Azure regions and availability zones? | Microsoft Learn*

Based on the Cloud computing model of the application components that you want to protect, you have the following options:

IaaS

Virtual machines can be deployed across multiple availability regions to provide an uptime SLA of 99,99%. This capability is offered by the Azure Site Recovery service that orchestrates the replication, failover, and recovery of the VMs. Site Recovery can be implemented with the following steps:

Open the VM for which you want to configure regional redundancy.
Under Operations, select “Disaster recovery”, set the option “Disaster Recovery between Availability Zones?” to “No” and select a target region.
Click on “Review and start replication”.

PaaS

PaaS services can be protected from regional disasters as follows:

Storage Accounts:
- With the Geo redundant Storage option (GRS), Microsoft ensures that there are 3 instances of the Storage Account in the primary region and another 3 instances in the secondary region. This offers an SLA of 99.99999999999999% (16 nines) uptime over a given year. There is also the option of Geo-zone-Redundant Storage (GZRS) which spreads the instances of the primary region across 3 different availability zones, to enable fastest recovery times in case of a datacenter failure.
- The option can be enabled under Data Management – Redundancy for an existing Storage Account.
SQL Databases:
- An additional replica for read operations can be created in a secondary region and used as a disaster recovery site. The SLAs for the geo-redundant setup vary depending on the selected service tier.
- The option can be configured for a given SQL DB, under the Service tier selection menu.
Azure Cosmos DB Accounts:
- Enabling geo redundancy in an Azure Cosmos DB account can increase the SLA for read operations to 99.999%.
- The option can be selected during the Cosmos DB Account creation, under Global Distribution – Geo-Redundancy, or for an existing Cosmos DB account under Settings – Replicate data globally.
App Service:
- As of the time of writing, there is no geo-redundancy support for Azure App Service.

Before closing with High Availability, remember that a highly available system is a system that your customers and employees can rely on. It increases the credibility of your company, improves its reputation and offers peace of mind to your valuable users. Although costs may go up, depending on your implementation choices, it shall assist you to establish yourself as a trustworthy partner.

Implementing scalability

For systems whose load can abruptly increase or decrease, a problem arises: How can you guarantee the available level of resources during high periods, while at the same time keeping your costs to the minimum during the low periods? This is the essence of scaling, and in the cloud, achieving this balance is much easier than in traditional, on-premises infrastructures. There are two main ways that an application can scale: vertical scaling and horizontal scaling. Vertical scaling (scaling up) increases the capacity of a resource, for example, by increasing the VM size, CPU, memory, etc. Horizontal scaling (scaling out) adds new instances of a resource, such as VMs or database replicas.

While vertical scaling can be achieved more easily, and without making any changes to the application, at some point it hits a limit where the system cannot be scaled more. On the other hand, horizontal scaling is more flexible, cheaper, and applies to big, distributed workloads. It also enables autoscaling, which is the process of dynamically allocating resources to ensure performance. That is why, especially in the Cloud, horizontal scaling is the recommended option.

The options Azure provides you with are as follows:

IaaS

Scalability of Virtual Machines in Azure can be achieved through Virtual Machine Scale Sets (VMSS). These represent groups of load-balanced VMs that provide scalability to applications by automatically increasing or decreasing the number of VM instances in response to demand or a defined schedule. VMSS can be deployed into one Availability Zone, multiple Availability Zones or even regionally.

During the creation of a VMSS resource, the cloud consumer can specify the scalability options and minimum instance number, the Load Balancer (or Application Gateway, in case of HTTPS traffic) that will be used, and other networking and orchestration options.

PaaS

In most cases, managed PaaS services have horizontal scaling and autoscaling built in. The ease of scaling these services is a major advantage of using Azure PaaS services.

Specifically for App service, we should point out that the scaling options depend on the App Service plan (tier) and can reach a maximum of 100 instances when using the Isolated tier.

Implementing backups

Although HA solves the problem of small or extended failures, what happens if the unavailability of data originates from a malicious threat, such as a ransomware attack? In this case, having a highly available infrastructure will simply replicate the encrypted/corrupted files everywhere almost immediately, leaving no recovery options. Here is where the value of remote data copies, that are unaffected by real-time modifications, lies. With Azure Backup, regular backups and snapshots of workloads are taken, so that in case of unauthorized modification or deletion, the service can be restored to a specific point in time.

Backups in Azure can be implemented both for IaaS and for PaaS services, and the options are presented below.

IaaS

VM backups can be either locally redundant or zone redundant. Recovery from backups can be implemented in two ways: the standard option generates backups once a day and maintains instant restore snapshots for 2 days. The enhanced option generates multiple backups per day, maintains instant restore snapshots for 7 days and ensures that snapshots are spread across zones for increased resiliency. The second option applies to VMs of high criticality.

In an existing Azure VM, Backups can be configured under Operations – Backup.

PaaS

Multiple backup options also exist for different PaaS services:

Storage Accounts:
- Azure provides the option to configure operational backups of the blobs of a Storage Account. This is a local backup solution that maintains data for a specified duration in the source storage account itself. Although a Backup Vault is required to manage the backup, there is no copy of the data stored in the Vault. The backup is continuous and allows the reversion to a specific point in time in case of data corruption.
- The option can be configured under Data management – Data protection for existing Storage Accounts, by selecting the checkbox “Enable operational backup with Azure Backup” and following the necessary steps presented in the portal.
SQL Databases:
- Azure gives the option of locally redundant, zone-redundant or geo-redundant backups.
- Configuration can be performed with the option “Backup storage redundancy” under Settings – Compute + storage for existing databases, or under Basics – “Backup storage redundancy” during the database creation.
Azure Cosmos DB Accounts:
- Two options are offered for backup functionality, either periodic (LRS or ZRS or GRS) or continuous backup. When using the periodic backup mode, which is the default one for all accounts, backups are taken at periodic intervals and the data can only be restored by creating a request with Microsoft’s support team. On the contrary, continuous backup facilitates the restoration to any point of time within either 7 or 30 days (depending on the tier) through the portal.
- Can be configured under “Backup Policy” during creation, or under the “Backup & Restore” pane for existing accounts.
App Service:
- Azure provides the possibility to backup an App’s content, configuration and database by enabling and scheduling periodic backups, which are stored in a Storage Account.
- Backup and restoration options can be configured under Settings – Backups for existing App Services.

Overall, it is important to find the balance between the frequency of backups and the amount of maintained past snapshots, in order to lose as less data as possible in case of an incident, be able to revert to a healthy past state and at the same time maintain costs to an acceptable level for your company.

Conclusion

To conclude, it all ends up to one question: Can you survive? Can you recover from disasters as small as power interruptions to as big as pandemics and earthquakes? Business Continuity is the key to the answer. And in the modern, distributed Cloud world, all the capabilities are there – it’s just up to you, your dedication and commitment to implement the ones that are essential to your business.

Elpida Rouka

Elpida is an Information Security Consultant, with expertise in Azure/O365 Security, SIEM, Identity & Access management, Risk management, Information Security Management Systems (ISMS) and Business Continuity planning (ISO22301). She is always eager to create innovative high-quality solutions that precisely meet business needs.

Elpida Rouka | LinkedIn

Stijn Wellens

Stijn is a manager with experience in cloud and network security. He is Solution Lead for Cloud Security Assessments and Microsoft Cloud Security Engineering at NVISO. Besides the technical challenges during Azure and Microsoft 365 security roadmap implementations, Stijn enjoys coaching the teams by sharing his knowledge and experience.

Stijn Wellens | LinkedIn

Introduction

Implementing High Availability

Protection against hardware failures

IaaS

PaaS

Protection against datacenter failures

IaaS

PaaS

Protection against regional failures

IaaS

PaaS

Implementing scalability

IaaS

PaaS

Implementing backups

IaaS

PaaS

Conclusion

One thought on “Implementing Business Continuity on Azure”

Discover more from NVISO Labs