Disaster recovery strategies with Terraform

The total cost of unplanned outages has been rising exponentially each year. A 2016 study conducted by the Ponemon Institute stated that the mean total cost per minute of an unplanned outage was $8,851, a 32% increase since 2013, and a 81% increase since 2010. A 2022 study by EMA Research says that number is up to $12,900. These metrics showcase how crucial it is for organizations to have a solid and well-thought disaster recovery strategy in place in order to reduce downtime and data loss as much as possible once disaster strikes.

Ensuring business continuity and safeguarding mission-critical systems against unexpected failures can be time-consuming, expensive, and difficult to maintain, especially as systems scale. It is also not uncommon for disaster recovery (DR) solutions to cost enterprises anywhere from several hundreds of thousands to millions of dollars per year, creating significant strain on IT budgets within organizations.

However, setting up and maintaining DR infrastructure doesn’t have to be so cumbersome nor costly. This is where leveraging infrastructure as code (IaC) within your DR plan comes into play.

This blog post showcases how HashiCorp Terraform can be used to effectively setup, test, and validate your DR environments in a cost-efficient, practical, and consistent manner by codifying the infrastructure provisioning process.

»

Before diving into how Terraform can help provision and manage DR related infrastructure, it’s important to understand the concepts of Recovery Time Objective (RTO) and Recovery Point Objective (RPO), including how they differ from each other along with how they should be applied to your organization’s particular DR strategy:

Recovery Time Objective (RTO): Refers to the amount of time required to restore business operations after an unplanned outage takes place before negatively impacting the organization’s mission.
Recovery Point Objective (RPO): RPO refers to the amount of data a business can afford to lose, measured in time. This typically can vary anywhere from a few minutes to several hours depending on business requirements.

It’s also necessary to understand some of the most popular disaster recovery strategies. The list below starts with the least expensive strategy and the strategies get more expensive as you go further down the list (Figure 1 below shows each method on a spectrum of complexity and RTO/RPO).

Keep in mind that you will typically see a combination of these methods being used simultaneously within an organization’s DR strategy. For example, a container/VM cluster orchestrator will typically leverage the Pilot Light methodology, while database infrastructure might use the Backup & Data Recovery method:

Backup & Data Recovery: The least complex and least costly DR strategy covered here. This method involves backing up your systems/data to a different location and, in case of disaster, the data is restored from backup onto either an existing, or new system. This can be a simple and cost-effective strategy. However, depending on the amount of data and recovery process, can lead to high RTOs and/or RPOs.
Pilot Light: The goal of a Pilot Light environment is to have a minimalistic copy of your production environment with only the key components/services running in another location. When disaster occurs, the additional required components are provisioned and scaled up to production capacity. This strategy is typically quicker than the Backup/Data Recovery option, but it brings more complexity and cost as well.
Active/Passive: In this strategy, a fully functional replica of the production environment is created in a secondary location. This is a more expensive and complex strategy out of all the options discussed so far. However, it also provides the quickest recovery-time and minimal data loss compared to all of the previous methods.
Multi-Region Active/Active: This is where systems/applications are built to be distributed across various geographic regions. If one region fails, traffic is automatically redirected to other healthy regions. This is the most complex and expensive out of all strategies. It also provides the highest level of resilience and availability while also protecting mission-critical applications against full-region outages.

»Why use Terraform with your DR strategy?

If you have gone through the process of selecting and using DR tooling in the past, you most likely encountered one, or more, of the following problems:

Cost: As I previously mentioned, disaster recovery tools can be extremely expensive. Licensing fees coupled with ongoing costs of maintaining redundant, idle infrastructure can be a significant strain on IT budgets.
Lack of flexibility: DR toolsets are typically tied to a particular platform. This results in additional complexity and reduced flexibility when it comes to setting DR strategies across multiple cloud providers. This also applies to leveraging a managed solution from one of the major public clouds. While leveraging a cloud-specific DR solution may be convenient at first, it will limit your options for multi-cloud and hybrid strategies in the future as you expand.
Performance: These tools can also be very slow when it comes to performance and recovery speed. Legacy DR solutions typically rely on complex mechanisms that are slow and error prone, making desired RTO and RPO difficult to achieve.

Terraform not only helps solve all these issues, but provides several other key advantages when it is leveraged within your disaster recovery strategy:

Automation: Terraform allows you to automate the entire infrastructure deployment and recovery process, minimizing the need for manual intervention and greatly reducing risk of human error. This also ensures consistency and repeatability within your DR infrastructure setup.
Repeatability: With Terraform, you are adopting an infrastructure as code mindset, meaning that you ensure consistent infrastructure configuration across multiple environments by defining your infrastructure once in a codified manner. This mitigates configuration drift and ensures that your DR environment accurately mirrors your production setup.

Scalability: Terraform enables you to scale your environments as needed with ease, allowing you to test your DR infrastructure plans at scale, ensuring they can handle real-world scenarios.

Cost efficiency: Terraform allows you to dynamically provision and destroy ephemeral resources as needed, resulting in greatly reduced infrastructure costs as you only pay for the resources utilized during your DR exercise instead of incurring ongoing costs from resources that remain idle most of the time.

Flexibility: With Terraform being a cloud agnostic solution, you have the ability to not only spin up infrastructure in different availability zones or regions within a single cloud provider, but you can provision and manage resources across multiple cloud providers as well.

»

-refresh-only flag can update the Terraform state file to match the actual infrastructure state without modifying the infrastructure itself. This can be used after a backup or recovery operation in order to sync Terraform state and reduce drift.
Pilot Light and Active/Passive: Terraform conditional expressions can be leveraged to deploy only the required infrastructure components needed for a Pilot Light while keeping other resources in a dormant state, or label an Active/Passive configuration as on/off until a DR event occurs. Once a DR event occurs, conditionals can trigger resource scaling to full production capacity, ensuring minimal downtime and operational impact. The next section of this post shows an example of this Active/Passive cutover.
Multi-Region Active/Active: Terraform modules can be used to encapsulate and re-use infrastructure components. This plays a crucial role in ensuring consistency is maintained in large-scale, multi-region environments while simplifying infrastructure management by ensuring a single source of truth for your infrastructure code. As an example, you can parameterize our modules by region, ensuring you deploy the same infrastructure across various regions:

#Terraform modules parameterized by region
 
module "vpc" {
  source = "./modules/vpc"
  region = var.region
}
 
module "compute" {
  source = "./modules/compute"
  region = var.region
  instance_count = var.instance_count
}

It is also worth noting that the Terraform import command can be a valuable tool within your DR strategy by ensuring existing infrastructure created outside of Terraform is integrated and managed.

»

Amazon EC2 instance behind Route 53 (Refer to Figure 2 below).

The complete code repository for this example can be found here.

Note: I will be using my own domain already set up as an AWS Route 53 Hosted Zone (andrecfaria.com). If you are following along, this value should be replaced with whatever domain you set up within your Terraform configuration.

In a real-world scenario, your environment typically will be much more robust, most likely including:

Multiple web servers across several availability zones
Load balancers sitting in front of the web servers
Databases in both regions with cross-region replication in place
And more

However, for simplicity, this example only uses EC2 instance.

Figure 2 – Web server hosted on an Amazon EC2 instance behind Route 53

This scenario, employs the Active/Passive DR strategy with all of your infrastructure provisioned and managed through Terraform. However, the infrastructure required for a DR failover will only be provisioned when you trigger the failover itself, preventing ongoing costs related to idle compute instances and other cloud resources. After running a terraform apply, you see the following outputs:

Outputs:
 
current_active_environment = "Production"
dns_record = "test.andrecfaria.com"
production_public_ip = "18.234.86.230"

You can use the dig command to verify that your DNS record points to the production IP address:

$ dig test.andrecfaria.com
 
; >> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu >> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com.      	IN  	A
 
;; ANSWER SECTION:
test.andrecfaria.com.   60  	IN  	A   	18.234.86.230
 
;; Query time: 9 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:04:47 EST 2025
;; MSG SIZE  rcvd: 65

You can also run a curl command to visualize the contents of your production webpage:

$ curl "
h1>Hello World from Production!/h1>

Looking at the Terraform code, within the variables.tf file you can find the following dr_switchover variable:

variable "dr_switchover" {
  type        = bool
  description = "Flag to control environment switchover (false = Production | true = Disaster Recovery)"
  default     = false
}

This variable is a key component of the DR configuration because it will define whether the Route 53 DNS record points to the production web server (by keeping the default value of false), or if the record should switch over to the DR web server and create the required infrastructure resources for the DR failover to take place, by setting its value to true.

This is accomplished by leveraging the conditional expressions functionality of Terraform when setting the records argument within the aws_route53_record resource declaration, as well as leveraging the count argument within the DR resources.

# Route53 Record - Conditional based on dr_switchover
 
resource "aws_route53_record" "test" {
  zone_id = data.aws_route53_zone.selected.zone_id
  name    = "${var.subdomain}.${var.domain_name}"
  type    = "A"
  ttl = 60
  records = [var.dr_switchover ? aws_instance.dr_webserver.public_ip : aws_instance.prod_webserver.public_ip]
}

# Disaster Recovery EC2 Instance
 
resource "aws_instance" "dr_webserver" {
  count                  = var.dr_switchover ? 1 : 0
  provider               = aws.dr
  ami                    = var.dr_ami_id
  instance_type          = var.instance_type
  key_name               = var.key_name
  vpc_security_group_ids = [aws_security_group.dr_sg.id]
  user_data              = 
              #!/bin/bash
              sudo yum update -y
              sudo yum install -y nginx
              sudo systemctl start nginx
              sudo systemctl enable nginx
              echo "" | sudo tee /usr/share/nginx/html/index.html
              EOF
  tags = {
    Name        = "dr-instance"
    Environment = "Disaster Recovery"
  }
  depends_on = [aws_security_group.dr_sg]
}

The only change required in order to cutover to the DR environment is setting the value of the dr_switchover variable to true:

$ terraform apply -var="dr_switchover=true" -auto-approve

Below are the actions and output that Terraform will display when creating the DR EC2 instance and performing an in-place update to the Route 53 record resource, changing the records argument to point to your DR web server IP address instead of the production IP address:

Terraform will perform the following actions:
 
  # aws_instance.dr_webserver[0] will be created
  + resource "aws_instance" "dr_webserver" {
  	...
    }
 
  # aws_route53_record.test will be updated in-place
  ~ resource "aws_route53_record" "test" {
    	id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
    	name = "test.andrecfaria.com"
  	~ records = [
      	- "18.234.86.230",
    	] -> (known after apply)
    	# (7 unchanged attributes hidden)
	}
 
Plan: 1 to add, 1 to change, 0 to destroy.
 
Changes to Outputs:
  ~ current_active_environment = "Production" -> "Disaster Recovery"
  + dr_public_ip  = (known after apply)
 
 
Outputs:
 
current_active_environment = "Disaster Recovery"
dns_record = "test.andrecfaria.com"
dr_public_ip = "54.219.217.97"
production_public_ip = "18.234.86.230

Once the Terraform run is complete, you can validate that the DNS record now points to the DR web server by using the same dig and curl commands as before):

#dig command results showing DR IP address
 
$ dig test.andrecfaria.com
 
; >> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu >> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com.      	IN  	A
 
;; ANSWER SECTION:
test.andrecfaria.com.   60  	IN  	A   	54.219.217.97
 
;; Query time: 19 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:16:25 EST 2025
;; MSG SIZE  rcvd: 65

#curl command showcasing DR webpage contents
 
$ curl "
h1>Hello World from Disaster Recovery!/h1>

Finally, we can fail back to production by simply running the terraform apply command again, this time while setting the dr_switchover variable back to false. This will also destroy all the infrastructure created when failing over to DR, enabling us to prevent unnecessary spend related to idle resources.

#Setting the dr_switchover variable value via CLI
 
$ terraform apply -var="dr_switchover=false" -auto-approve

#Terraform apply run output
 
Terraform will perform the following actions:
 
  # aws_instance.dr_webserver[0] will be destroyed
  # (because index [0] is out of range for count)
  - resource "aws_instance" "dr_webserver" {
  	...
    }
 
  # aws_route53_record.test will be updated in-place
  ~ resource "aws_route53_record" "test" {
    	id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
    	name = "test.andrecfaria.com"
  	~ records = [
      	- "54.219.217.97",
      	+ "18.234.86.230",
    	]
    	# (7 unchanged attributes hidden)
	}
 
Plan: 0 to add, 1 to change, 1 to destroy.
 
Changes to Outputs:
  ~ current_active_environment = "Disaster Recovery" -> "Production"
  - dr_public_ip = "54.219.217.97" -> null

»

Some additional considerations to be mindful of when using Terraform for DR infrastructure provisioning include, but are not limited to:

Application install time: Applications that are not dependent on Terraform can take additional time to be installed and configured when performing a DR failover. Ensure that this is accounted for when determining RTO.
DNS propagation time: Keep in mind that DNS changes might take time to propagate. This can be mitigated by proactively lowering the time-to-live values of your DNS records a few days prior to the migration in the event of a planned failover.
Backups: Terraform does not backup your data and is not a replacement for your backup systems. Ensure that you have a solid backup strategy in place that meets your requirements in addition to your DR strategy.

»

HashiCorp developer portal, where you can find more information regarding best practices, integrations, and reference architectures.

Source link

What's Hot

AI in Document Management and Data Standardization: Transforming Business Workflows – AI Time Journal

Vibe Coding: Pairing vs. Delegation

Synthesia AI Reaches $2.1 Billion Valuation

Issue #198 – Serverless Streamlit, Baseline Azure Environment, Event-Driven Image Processing, Wait-for-it, Delete Resources, Cloud Federation

Patterns to refactor infrastructure as code for compliance

IBM Acquisition Of HashiCorp Completed: A New Era For Cloud Innovation

Comparing Top AI Models [2025]

Features, Benefits, Alternatives and Review • AI Parabellum

Q&A: The climate impact of generative AI | MIT News

Most Popular