The total cost of unplanned outages has been rising exponentially each year. A 2016 study conducted by the Ponemon Institute stated that the mean total cost per minute of an unplanned outage was $8,851, a 32% increase since 2013, and a 81% increase since 2010. A 2022 study by EMA Research says that number is up to $12,900. These metrics showcase how crucial it is for organizations to have a solid and well-thought disaster recovery strategy in place in order to reduce downtime and data loss as much as possible once disaster strikes.
Ensuring business continuity and safeguarding mission-critical systems against unexpected failures can be time-consuming, expensive, and difficult to maintain, especially as systems scale. It is also not uncommon for disaster recovery (DR) solutions to cost enterprises anywhere from several hundreds of thousands to millions of dollars per year, creating significant strain on IT budgets within organizations.
However, setting up and maintaining DR infrastructure doesn’t have to be so cumbersome nor costly. This is where leveraging infrastructure as code (IaC) within your DR plan comes into play.
This blog post showcases how HashiCorp Terraform can be used to effectively setup, test, and validate your DR environments in a cost-efficient, practical, and consistent manner by codifying the infrastructure provisioning process.
» Disaster recovery strategies and terminologies
Before diving into how Terraform can help provision and manage DR related infrastructure, it’s important to understand the concepts of Recovery Time Objective (RTO) and Recovery Point Objective (RPO), including how they differ from each other along with how they should be applied to your organization’s particular DR strategy:
- Recovery Time Objective (RTO): Refers to the amount of time required to restore business operations after an unplanned outage takes place before negatively impacting the organization’s mission.
- Recovery Point Objective (RPO): RPO refers to the amount of data a business can afford to lose, measured in time. This typically can vary anywhere from a few minutes to several hours depending on business requirements.
It’s also necessary to understand some of the most popular disaster recovery strategies. The list below starts with the least expensive strategy and the strategies get more expensive as you go further down the list (Figure 1 below shows each method on a spectrum of complexity and RTO/RPO).
Keep in mind that you will typically see a combination of these methods being used simultaneously within an organization’s DR strategy. For example, a container/VM cluster orchestrator will typically leverage the Pilot Light methodology, while database infrastructure might use the Backup & Data Recovery method:
- Backup & Data Recovery: The least complex and least costly DR strategy covered here. This method involves backing up your systems/data to a different location and, in case of disaster, the data is restored from backup onto either an existing, or new system. This can be a simple and cost-effective strategy. However, depending on the amount of data and recovery process, can lead to high RTOs and/or RPOs.
- Pilot Light: The goal of a Pilot Light environment is to have a minimalistic copy of your production environment with only the key components/services running in another location. When disaster occurs, the additional required components are provisioned and scaled up to production capacity. This strategy is typically quicker than the Backup/Data Recovery option, but it brings more complexity and cost as well.
- Active/Passive: In this strategy, a fully functional replica of the production environment is created in a secondary location. This is a more expensive and complex strategy out of all the options discussed so far. However, it also provides the quickest recovery-time and minimal data loss compared to all of the previous methods.
- Multi-Region Active/Active: This is where systems/applications are built to be distributed across various geographic regions. If one region fails, traffic is automatically redirected to other healthy regions. This is the most complex and expensive out of all strategies. It also provides the highest level of resilience and availability while also protecting mission-critical applications against full-region outages.

Figure 1 – Disaster recovery strategies sorted by complexity and RTO/RPO
»Why use Terraform with your DR strategy?
If you have gone through the process of selecting and using DR tooling in the past, you most likely encountered one, or more, of the following problems:
- Cost: As I previously mentioned, disaster recovery tools can be extremely expensive. Licensing fees coupled with ongoing costs of maintaining redundant, idle infrastructure can be a significant strain on IT budgets.
- Lack of flexibility: DR toolsets are typically tied to a particular platform. This results in additional complexity and reduced flexibility when it comes to setting DR strategies across multiple cloud providers. This also applies to leveraging a managed solution from one of the major public clouds. While leveraging a cloud-specific DR solution may be convenient at first, it will limit your options for multi-cloud and hybrid strategies in the future as you expand.
- Performance: These tools can also be very slow when it comes to performance and recovery speed. Legacy DR solutions typically rely on complex mechanisms that are slow and error prone, making desired RTO and RPO difficult to achieve.
Terraform not only helps solve all these issues, but provides several other key advantages when it is leveraged within your disaster recovery strategy:
- Automation: Terraform allows you to automate the entire infrastructure deployment and recovery process, minimizing the need for manual intervention and greatly reducing risk of human error. This also ensures consistency and repeatability within your DR infrastructure setup.
- Repeatability: With Terraform, you are adopting an infrastructure as code mindset, meaning that you ensure consistent infrastructure configuration across multiple environments by defining your infrastructure once in a codified manner. This mitigates configuration drift and ensures that your DR environment accurately mirrors your production setup.
- Scalability: Terraform enables you to scale your environments as needed with ease, allowing you to test your DR infrastructure plans at scale, ensuring they can handle real-world scenarios.
- Cost efficiency: Terraform allows you to dynamically provision and destroy ephemeral resources as needed, resulting in greatly reduced infrastructure costs as you only pay for the resources utilized during your DR exercise instead of incurring ongoing costs from resources that remain idle most of the time.
- Flexibility: With Terraform being a cloud agnostic solution, you have the ability to not only spin up infrastructure in different availability zones or regions within a single cloud provider, but you can provision and manage resources across multiple cloud providers as well.
» How to use Terraform with your DR strategy
Lets revisit the DR strategies mentioned previously and take a look at examples of how Terraform can be utilized with each one:
- Backup & Data Recovery: The -refresh-only flag can update the Terraform state file to match the actual infrastructure state without modifying the infrastructure itself. This can be used after a backup or recovery operation in order to sync Terraform state and reduce drift.
- Pilot Light and Active/Passive: Terraform conditional expressions can be leveraged to deploy only the required infrastructure components needed for a Pilot Light while keeping other resources in a dormant state, or label an Active/Passive configuration as on/off until a DR event occurs. Once a DR event occurs, conditionals can trigger resource scaling to full production capacity, ensuring minimal downtime and operational impact. The next section of this post shows an example of this Active/Passive cutover.
- Multi-Region Active/Active: Terraform modules can be used to encapsulate and re-use infrastructure components. This plays a crucial role in ensuring consistency is maintained in large-scale, multi-region environments while simplifying infrastructure management by ensuring a single source of truth for your infrastructure code. As an example, you can parameterize our modules by region, ensuring you deploy the same infrastructure across various regions:
#Terraform modules parameterized by region
module "vpc" {
source = "./modules/vpc"
region = var.region
}
module "compute" {
source = "./modules/compute"
region = var.region
instance_count = var.instance_count
}
It is also worth noting that the Terraform import command can be a valuable tool within your DR strategy by ensuring existing infrastructure created outside of Terraform is integrated and managed.
» Disaster Recovery Active/Passive cutover example
To demonstrate how you can leverage Terraform for your DR strategy, the example below shows how to conduct a complete region failover within AWS for a web server hosted on an Amazon EC2 instance behind Route 53 (Refer to Figure 2 below).
The complete code repository for this example can be found here.
Note: I will be using my own domain already set up as an AWS Route 53 Hosted Zone (andrecfaria.com). If you are following along, this value should be replaced with whatever domain you set up within your Terraform configuration.
In a real-world scenario, your environment typically will be much more robust, most likely including:
- Multiple web servers across several availability zones
- Load balancers sitting in front of the web servers
- Databases in both regions with cross-region replication in place
- And more
However, for simplicity, this example only uses EC2 instance.

Figure 2 – Web server hosted on an Amazon EC2 instance behind Route 53
This scenario, employs the Active/Passive DR strategy with all of your infrastructure provisioned and managed through Terraform. However, the infrastructure required for a DR failover will only be provisioned when you trigger the failover itself, preventing ongoing costs related to idle compute instances and other cloud resources. After running a terraform apply
, you see the following outputs:
Outputs:
current_active_environment = "Production"
dns_record = "test.andrecfaria.com"
production_public_ip = "18.234.86.230"
You can use the dig
command to verify that your DNS record points to the production IP address:
$ dig test.andrecfaria.com
; >> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu >> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com. IN A
;; ANSWER SECTION:
test.andrecfaria.com. 60 IN A 18.234.86.230
;; Query time: 9 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:04:47 EST 2025
;; MSG SIZE rcvd: 65
You can also run a curl
command to visualize the contents of your production webpage:
$ curl "
h1>Hello World from Production!/h1>
Looking at the Terraform code, within the variables.tf
file you can find the following dr_switchover
variable:
variable "dr_switchover" {
type = bool
description = "Flag to control environment switchover (false = Production | true = Disaster Recovery)"
default = false
}
This variable is a key component of the DR configuration because it will define whether the Route 53 DNS record points to the production web server (by keeping the default value of false
), or if the record should switch over to the DR web server and create the required infrastructure resources for the DR failover to take place, by setting its value to true
.
This is accomplished by leveraging the conditional expressions functionality of Terraform when setting the records
argument within the aws_route53_record
resource declaration, as well as leveraging the count
argument within the DR resources.
# Route53 Record - Conditional based on dr_switchover
resource "aws_route53_record" "test" {
zone_id = data.aws_route53_zone.selected.zone_id
name = "${var.subdomain}.${var.domain_name}"
type = "A"
ttl = 60
records = [var.dr_switchover ? aws_instance.dr_webserver.public_ip : aws_instance.prod_webserver.public_ip]
}
# Disaster Recovery EC2 Instance
resource "aws_instance" "dr_webserver" {
count = var.dr_switchover ? 1 : 0
provider = aws.dr
ami = var.dr_ami_id
instance_type = var.instance_type
key_name = var.key_name
vpc_security_group_ids = [aws_security_group.dr_sg.id]
user_data =
#!/bin/bash
sudo yum update -y
sudo yum install -y nginx
sudo systemctl start nginx
sudo systemctl enable nginx
echo "" | sudo tee /usr/share/nginx/html/index.html
EOF
tags = {
Name = "dr-instance"
Environment = "Disaster Recovery"
}
depends_on = [aws_security_group.dr_sg]
}
The only change required in order to cutover to the DR environment is setting the value of the dr_switchover
variable to true
:
$ terraform apply -var="dr_switchover=true" -auto-approve
Below are the actions and output that Terraform will display when creating the DR EC2 instance and performing an in-place update to the Route 53 record resource, changing the records argument to point to your DR web server IP address instead of the production IP address:
Terraform will perform the following actions:
# aws_instance.dr_webserver[0] will be created
+ resource "aws_instance" "dr_webserver" {
...
}
# aws_route53_record.test will be updated in-place
~ resource "aws_route53_record" "test" {
id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
name = "test.andrecfaria.com"
~ records = [
- "18.234.86.230",
] -> (known after apply)
# (7 unchanged attributes hidden)
}
Plan: 1 to add, 1 to change, 0 to destroy.
Changes to Outputs:
~ current_active_environment = "Production" -> "Disaster Recovery"
+ dr_public_ip = (known after apply)
Outputs:
current_active_environment = "Disaster Recovery"
dns_record = "test.andrecfaria.com"
dr_public_ip = "54.219.217.97"
production_public_ip = "18.234.86.230
Once the Terraform run is complete, you can validate that the DNS record now points to the DR web server by using the same dig
and curl
commands as before):
#dig command results showing DR IP address
$ dig test.andrecfaria.com
; >> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu >> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com. IN A
;; ANSWER SECTION:
test.andrecfaria.com. 60 IN A 54.219.217.97
;; Query time: 19 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:16:25 EST 2025
;; MSG SIZE rcvd: 65
#curl command showcasing DR webpage contents
$ curl "
h1>Hello World from Disaster Recovery!/h1>
Finally, we can fail back to production by simply running the terraform apply
command again, this time while setting the dr_switchover
variable back to false
. This will also destroy all the infrastructure created when failing over to DR, enabling us to prevent unnecessary spend related to idle resources.
#Setting the dr_switchover variable value via CLI
$ terraform apply -var="dr_switchover=false" -auto-approve
#Terraform apply run output
Terraform will perform the following actions:
# aws_instance.dr_webserver[0] will be destroyed
# (because index [0] is out of range for count)
- resource "aws_instance" "dr_webserver" {
...
}
# aws_route53_record.test will be updated in-place
~ resource "aws_route53_record" "test" {
id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
name = "test.andrecfaria.com"
~ records = [
- "54.219.217.97",
+ "18.234.86.230",
]
# (7 unchanged attributes hidden)
}
Plan: 0 to add, 1 to change, 1 to destroy.
Changes to Outputs:
~ current_active_environment = "Disaster Recovery" -> "Production"
- dr_public_ip = "54.219.217.97" -> null
» Cleanup
If you have been following along by deploying your own resources, don’t forget to run the terraform destroy
command in order to clean up your environment and not incur any unwanted costs.
» Other considerations
Some additional considerations to be mindful of when using Terraform for DR infrastructure provisioning include, but are not limited to:
- Application install time: Applications that are not dependent on Terraform can take additional time to be installed and configured when performing a DR failover. Ensure that this is accounted for when determining RTO.
- DNS propagation time: Keep in mind that DNS changes might take time to propagate. This can be mitigated by proactively lowering the time-to-live values of your DNS records a few days prior to the migration in the event of a planned failover.
- Backups: Terraform does not backup your data and is not a replacement for your backup systems. Ensure that you have a solid backup strategy in place that meets your requirements in addition to your DR strategy.
» Conclusion
This blog post demonstrated how Terraform can be leveraged to automate, simplify, and reduce costs related to provisioning and managing infrastructure within your disaster recovery strategy. To learn more about Terraform, visit the HashiCorp developer portal, where you can find more information regarding best practices, integrations, and reference architectures.