Table of Contents
Intro
As reality hits, the unavoidable fact of dealing with a hard-to-manage Terraform Big ball of mud code base comes in. There is no way around natural growth and evolution of code bases and the design flaws that come with it. Our Agile mindset is to “move fast and break things”, implement something as simple as possible and let the design decisions for the next iterations (if any).
Refactoring Terraform code is actually as natural as developing it, time and time again you will be face situation where a better structure or organization can be achieved, maybe you want to upgrade from a home-made module to an open-source/community alternative, maybe you just want to segregate your resources into different states to speed-up development. Regardless of the goal, once you get into it, you will realize that Terraform code refactoring is actually a basic missing step on the development process that no one told you before.
As the Suffering-Oriented Programming mantra dictates:
“First make it possible. Then make it beautiful. Then make it fast.”
So, time to make the Terraform code beautiful!
How to break a big ball of mud? STRANGLE IT
<joke>
Martin Fowler has already written everything there is to write about (early 2000s) DevOps, Agile, and Software Development. Therefore, we could reference Martin Fowler for virtually anything Software related</joke>
, but really, the Refactoring book is THE reference on this subject.
Martin Fowler shared the Stangler (Fig) Pattern, which describes a strategy to refactor a legacy code base by re-implementing the same features (sometimes even the bugs) on another application.
[…] the huge strangler figs. They seed in the upper branches of a tree and gradually work their way down the tree until they root in the soil. Over many years they grow into fantastic and beautiful shapes, meanwhile strangling and killing the tree that was their host.
This metaphor struck me as a way of describing a way of doing a rewrite of an important system.
In this document we are going to follow the same idea:
- implement the same feature on a different Terraform composition;
- migrate the Terraform state;
- delete (kill) the previous implementation.
The mono-repository (monorepo) approach to Legacy
Let’s suppose that your Terraform code base is versioned in a single repository (a.k.a. monorepo), following the random structure displayed below (just to help illustrate)
.
├── modules/ # Definition of TF modules used by underlying compositions
├── global/ # Resources that aren't restricted to one environment
│ ├── aws/
├── production/ # Production environment resources
│ └── aws/
└── staging/ # Staging environment resources
└── aws/
In this example, each directory corresponds to a Terraform state. In order to apply changes, you have to walk to a path and execute terraform
.
The structure on this example repository was created a few hypothetical years ago when the number of existing microservices and resources (DB, message queues, etc) was significantly smaller. At the time, it was feasible to keep Terraform definitions together because it was easier to maintain, Cloud resources were managed with one-shot!
As time went by, the number of Products and the team grew, and engineers started facing concurrency issues: Terraform lock executions on shared storage when someone else is running terraform apply
as well as a general slowness on every execution since the number of data sources to sync is frightening.
A mono-repository approach is not necessarily bad, versioning is actually simpler when performed in one single repository. Ideally, there won’t be many changes on the scale of GiB meaning that it is safe to proceed on this one as long as the Terraform remote states are divided.
Splitting the modules
sub-path to its own repository
One thing to mention though is the modules
sub-path, this one could be stored in a different git repository to leverage its own versioning. Since Terraform modules and their implementations don’t always evolve at the same pace, keeping two distinct version trees is beneficial. Additionally, a separated repository for Terraform modules allows the specification of “pinned versions”, e.g.:
module "aws_main_vpc" {
source = "git::https://github.com/terraform-aws-modules/terraform-aws-vpc.git?ref=2ca733d"
# Note the ref=${GIT_REVISION_DIGEST}
}
That reference for a module’s version should always be specified, regardless if it comes from an internal/private repository or public. When you specify the version, you are ensuring reproducibility.
Therefore, let’s move the modules
sub-path to another git repository, following instructions from this StackOverflow answer so that the git commit history is preserved:
0.
Walk to the monorepo path and create a branch from the commits at monorepo/modules
path
MAIN_BIGGER_REPO=/path/to/the/monorepo
cd "${MAIN_BIGGER_REPO}"
git subtree split -P modules -b refact-modules
1.
Create the new repository
mkdir /path/to/the/terraform-modules && cd $_
git init
git pull "${MAIN_BIGGER_REPO}" refact-modules
2.
Link the new repository to your remote Git (server) and push the commits
git remote add origin <git@git.com:user/terraform-modules.git>
git push -u origin master
3.
Cleanup the history related to modules
from $MAIN_BIGGER_REPO
[OPTIONAL]
cd ${MAIN_BIGGER_REPO}
git rm -rf modules
git filter-branch --prune-empty \
--tree-filter "rm -rf modules" -f HEAD
Let’s start strangling the repository
Now that a substantial piece of code was moved somewhere else, it is time to put the Stangler (Fig) Pattern in practice.
Move all the existing content as-is to the legacy
sub-path, keeping the same repository and change history (commits). It also allows applying the legacy
code as it used to be from one of those paths.
.
└── legacy
├── global
│ └── aws
├── production
│ └── aws
└── staging
└── aws
Once the content is moved to legacy, the idea is to follow the Boy Scout rule in order to strangle the legacy
content little by little (unless you are
really committed to migrating it all at once, which is going to be exhaustive).
The Boy Scout rule goes like this:
- every time a task that involves deprecated code appears, we implement it on the new structure;
- import the Terraform state to keep the Cloud resources that a given code represents/describes;
- remove the state and the code from
legacy
.
Until there is nothing left inside legacy
(or there are only unused resources/left-behinds that could be destroyed/garbage collected either way).
Import state? Remove state and code from what? Where?
That will depend on the kind of resource we are migrating from the remote state, on the bottom of each resource
on Terraform’s provider documentation you can find a reference command to import existing resources into your Terraform code specification. e.g.: AWS RDS DB instance.
Suppose we want to replace the code of the AWS RDS Aurora defined in production/aws
and then re-implement the same using the community module. After creating the corresponding sub-path to the monorepo according to your preference, provisioning the bucket and initializing the Terraform backend
:
1.
Implement the definition of the community module github.com/terraform-aws-modules/terraform-aws-rds-aurora with the closest parameters from the existing one; e.g.:
module "aws_aurora_main_cluster" {
source = "terraform-aws-modules/rds-aurora/aws"
version = "~> 5.2"
# ...
}
2.
Import the Terraform states from the previous (existing) cluster
terraform import 'aws_aurora_main_cluster.aws_rds_cluster.this[0]' main-database-name
terraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[0]' main-database-instance-name-01
terraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[1]' main-database-instance-name-02
# ...
then if you haven’t yet and would like to “match reality” between the existing and the specified resource, run terraform plan
a few times and adjust the parameters until Terraform reports:
No changes. Your infrastructure matches the configuration.
3.
Last but not least, remove the corresponding resources from the legacy
Terraform state so that it doesn’t try to keep track of the changes and also don’t try to destroy once the resource definition is no longer in that code base:
# Hypothetical name of the resource inside production/aws/main.tf
terraform state rm aws_rds_cluster.default \
'aws_rds_cluster_instance.default[0]' 'aws_rds_cluster_instance.default[1]'
# ...
once that is performed, feel free to remove the corresponding resource’s definition from the legacy
code.
resource "aws_rds_cluster" "default" {
# ...
}
resource "aws_rds_cluster_instance" "default" {
count = var.number_of_database_instances
# ...
}
Top comments (0)