langyizhao

Posted on Mar 15, 2021

Infrastructure as Code: the 5 Questions to Ask before You Start

According to Wikipedia, infrastructure as code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.

Although I don’t think I can define it any better, I believe that “as code” implies slightly more than those definition files. Especially since over the years, we as developers have accumulated so many best practices and principles dealing with code, there is no reason we can’t apply those experiences to the infrastructure if we make it as code.

Question 0: Should I consider Infrastructure as Code?

Consider this a bonus question, but an important one before embarking on any path towards IaC. Let’s assume that if you are reading this article, you are either considering IaC or more likely than not, already starting your journey.

Infrastructure as Code (IaC) is an inevitable phase of the DevOps movement. Automation is a cornerstone of DevOps culture [1], and after you finally automated your codebase with Version Control and CICD (or at least CI) pipelines, manual maneuvers of the underlying structure would feel so out of the place. People may call it with different names, like playbooks, cookbooks, manifests, and templates, but they are all the same as infrastructure code.

One common objection to Infrastructure as Code (or perhaps any automation) is "We don’t make changes often enough to justify automating them" [2].

Intuitively, it may feel true that changes happen much less often at the infrastructure level than at the code level.

But wait a moment, have you considered configuration changes? How about security patches (like updating to the latest AMI if you are using AWS)? Scaling up your cluster before Black Friday? Have you ever restored your database from a snapshot?

These are all changes related to infrastructure and can be made easier with IaC.

Even when infrastructure changes are infrequent, there are many additional benefits of stack created via IaC. For example, it takes much less effort to fully control your resources (e.g. cleaning it up completely or spinning it up again later), thus ideal for any POC type of work, especially in cloud environments. I've seen teams unknowingly kept huge orphan EC2 and RDS instances running forever, contributing steady revenues to Amazon, only because they started them manually and lost the due observability very quickly.

So from our perspective, the answer to whether you should consider IaC (even if you don’t adopt it) is most definitely yes.

Question 1: Infrastructure as Code or no Infrastructure at all (a.k.a. serverless)?

It is beneficial to have your infrastructure as code. But sometimes less is more, and you don't even need your own infrastructure - your cloud provider takes care of it for you under Function-as-a-Service (FaaS), or serverless.

I am a heavy user of AWS Lambda Functions because some tasks are so decoupled from other services that they are literally nothing but functions.

A simple scheduled job to send out notifications to either humans or external systems? A good candidate for a serverless function. Of course, you can have a cron job in an EC2 instance but that would be crazy to have its own instance (a waste of resource and a burden for maintenance), and probably too coupled to have the instance shared with many other unrelated jobs.

A filter to route important info via a webhook of your log aggregator to your Slack? Good for serverless too. You can write, test, and even deploy one in an hour and forget about it.

A popular cloud-agnostic choice of tooling is the Serverless Framework. Regarding languages, Go, Python, and Node.js all work very well. C#, Java (or JVM languages using the Java runtime) are not preferred for AWS Lambda because of the infamous cold-starts [3].

My own rule of thumb is if I can handle a task with no more than 10 files of source code (excluding external libraries) in Python or Node.js, I would consider using FaaS. Avoid being too obsessed: The tendency of splitting services small enough to fit into functions can drive microservices into nano services, which is considered an anti-pattern by many [4].

All being said, investing in serverless options for a subset of your services doesn't contradict the codification of your infrastructure in the big picture. You still can consider a not-so-standalone serverless function as a part of your infrastructure, connected with other parts of your system by API calls or streaming (SQS, Kinesis, Kafka, etc.)

Question 2: To Terraform or not to Terraform?

Among the Infrastructure as Code tools, Terraform popularity has undeniably grown in the last few years. Two of the four teams I most recently worked with were using Terraform as the primary tool (combining with other tools such as Kubernetes), with one other using plain CloudFormation and another completely going serverless. To decide whether Terraform is the right choice for your infrastructure, we can try to start with some comparisons.

Terraform vs. Cloud-specific Options

Every major cloud provider invented their own template format that you can use to draft your infrastructure code such as ARM Template from Azure and CloudFormation from AWS.

A very obvious advantage of using these formats is that no one can be more familiar with their cloud than the people developing them, thus you can expect that those templates reflect all of the latest cool features the cloud provider has added or is planning to add. If using cloud-agnostic options, you would sometimes have to wait for the community to come up with a solution to provision something only offered by a specific cloud provider. Tools with a larger community and ecosystem, like Terraform, would likely react faster than those with a smaller community.

On the other hand, using the template from a certain cloud provider would inevitably result in vendor lock-in. You can't run your CloudFormation on GCP or Azure, and it is usually harder than it looks to translate among templates.

Using cloud-agnostic tools like Terraform makes it theoretically easier to switch among cloud providers. Some companies run stacks on multi-cloud with intentional data redundancy in each cloud because, in reality, you can't escape vendor lock-in if that vendor holds all your data.

Besides the concern of vendor lock-in, Terraform is also useful for the polycloud strategy, which is slightly different from multi-cloud. Polycloud implies you leverage components from different cloud providers in the same infrastructure stack. As an extreme case, you want a GCP Compute Engine instance to run some Azure Machine Learning task and save the result into AWS S3. Using Terraform with 3 providers configured is a much saner choice than writing one template for each cloud.

Terraform vs. Other Cloud-agnostic Options

Excluding the cloud-specific options still leaves a lot of options out there. Big players including Chef, Puppet, and Ansible (though some may argue that they are more server configuration tools than infrastructure creating tools [2]), and my favorite is Kubernetes (commonly stylized as K8s).

Kubernetes Is incredibly powerful and has an even larger ecosystem than Terraform. However, it might be unfair to compare them because Kubernetes is a container orchestration system that happens to meet the need of Infrastructure as Code and Terraform are scoped much narrowly.

In Kubernetes, you define all resources in your infrastructure via manifest files (usually in the form of YAML), and the control plane manages them for you, almost magically. You need to know even fewer details about the underlying hardware than Terraform to provision new stacks, within an established K8s cluster. And it's much less likely to mess things up if role-based access control (RBAC) was configured correctly in that cluster.

However setting up a Kubernetes cluster, especially for the first time, can be very hard. It would be easier if your cloud provider gives you managed options such as AWS EKS or GCP GKE, but still pretty involved to get it configured properly.

OpenShift is a similar option to Kubernetes, and in fact, it is running K8s underneath and provides an extra abstraction layer. Some may find it useful, for me the vanilla K8s provides just the right level of abstraction. Yet another layer feels a little redundant and doesn't adjust the new moving parts introduced.

Besides these established options there have also emerged many newer, smaller players that you may be interested in. Pulumi, enables you to write your infrastructure code with the same languages you would usually use for your application code, such as Python, JS, TS, and Go, in case you are not a fan of the declarative DSL models used by most tools.

Other Pros and Cons of Terraform

The module system is a major strength of Terraform that doesn't exist in many cloud-agnostic or cloud-specific options natively. Not only it dramatically increases the reusability of your infrastructure code, but you also get the invaluable opportunity to distill the difference between the different environments in just a couple of files.

.
├── modules
│   ├── backend
│   │   ├── main.tf
│   │   ├── user_data.sh
│   │   └── variables.tf
│   ├── documentDB
│   │   ├── main.tf
│   │   └── variables.tf
│   ├── frontend
│   │   ├── install_docker_compose.sh
│   │   ├── main.tf
│   │   ├── routings.tf
│   │   └── variables.tf
│   ├── grafana
│   │   ├── main.tf
│   │   └── variables.tf
│   ├── postgreSQL
│   │   ├── main.tf
│   │   └── variables.tf
│   └── redis
│       ├── main.tf
│       └── variables.tf
├── production
│   ├── main.tf
│   └── variables.tf
├── qa
│   ├── main.tf
│   └── variables.tf
└── staging
    ├── main.tf
    └── variables.tf

In the above example, all the differences in configuration between each environment are strictly scoped in the stack level variables.tf files, and they would be very concise if your module-level variables.tf files have sensible defaults.

The biggest gripe about working with Terraform is the fragility of its "state file", which is terraform.tfstate by default. It is understandable that a state must be persisted somewhere to manage any dynamic infrastructure, but many tools abstract the trouble away from you. For example, CloudFormation provides drift detection as a service, through the console or API, and Kubernetes strives to be as stateless as possible.

Terraform leaves the duty of safekeeping the state of the stack to you, the applier of the infrastructure code. There are a few different ways of managing the state file, but the last thing you would want to do is to keep it publicly in your repo because anyone can open the file with a text editor and peek at the plaintext secrets inside. For AWS environments consider using the "s3" backend to store the state file. A properly secured private S3 bucket is required and a DynamoDB table is used for locking (the same bucket and lock table can be used for multiple stacks).

I highly recommend enabling the versioning feature on the cloud storage service containing your state file so you can restore your corrupted state file from the last working version. If you can't enable it or you store your state file with a local backend, you may want to back up the state file before applying anything nontrivial.

Another not-so-obvious issue of Terraform is the version compatibility. Terraform is not even 1.0 at the time of writing, which is more than 6 years after its initial release. Being sub-1.0, on the bright side, implies it's still rapidly evolving, but meanwhile, this means each version could break backward-compatibility and get forgiven. Hashicorp does provide you helper commands or tools to migrate, but from my experience upgrading many stacks either from 0.11 to 0.12 or 0.12 to 0.13, there is always a chance to miss something and corrupt your state file. Again, do keep a backup of your state file! For Kubernetes, I've had much fewer issues when it upgraded minor versions.

Question 3: When or how frequently should I apply my IaC changes?

After you finish coding your infrastructure, you will need to apply it to the real world to have your stack instantiated. You can choose the way to apply based on the requirements of how soon and how often changes will be made.

The fastest way to create your infrastructure from code or apply changes is to run it manually on CLI. Usually, the only other thing you need is the right access from your cloud provider. For AWS, it could be an STS token to assume an admin role or an ID/secret pair in the .aws folder or as environment variables.

If the idea of creating stacks directly from your laptop scares you or your security team, you can run the same commands after SSHing into a dedicated host with preset system roles just enough for managing your piece of infrastructure. This manual approach gives you more flexibility in the timing of applying stack changes, with the drawback of noncompliance in many companies, unless it's just a POC project.

A more widely accepted approach is to embed your infrastructure change into a CI pipeline. There are lots of plugins pairing CI platforms with IaC tools such as the Terraform Plugin on Jenkins. But more often than not, you don't even need any plugin to apply your infrastructure code in your CI. In Gitlab CI, for example, I only need to set the job image as hashicorp/terraform with the image tag matching my script to run any Terraform commands I would usually run in my CLI.

There are also other ways to apply your changes with third-party tools. Atlantis, for example, applies your changes when a Pull Request containing changes in Terraform files is submitted into Gitlab. Using the Github Pull Request page as the main interaction point is an increasingly popular choice for many modern tools (e.g. the static code analysis platform MuseDev) because it is one of the few points in the development lifecycle that draws the attention of the entire team.

Regardless of your choice, there is always the risk to build your stack into something monolithic, which slows you down in the long run. It is almost always helpful to make your changes small, modular, and independent of each other. This is especially true if you agree with the "Treat your servers like cattle, not pets" mindset [5], which means you should have the majority of your system disposable and reproducible.

For example, if you use Terraform, as we discussed above, the concept of modules would be your friend. You may want to start creating modules early and keep your root level main.tf file as small as possible: Ideally only a set of module blocks beside the mandatory terraform and provider blocks. Moreover, considering the risk of the state file corruption mentioned earlier, you may want to avoid having too many modules in the same stack sharing the same tfstate file.

Want to be even more confident to apply your change as frequently as you want? Consider testing your Infrastructure code, which is the next question to ask.

Question 4: How to Test my Infrastructure Code?

When you catch a colleague testing his code in production, will you recall this famous "I test in production" meme and laugh at him?

Well, how about when his code is in the context of "Infrastructure as Code"? Will your reaction change to "Oh, testing THAT in production is OK."? But is the infrastructure code really so different from the application code we are familiar with: that it is not testable in lower environments?

To answer the question let's first admit that infrastructure code IS harder to test. For one thing, you can hardly create a miniature version of the full infrastructure stack in your laptop, thus local testing is not always feasible. For another, mocking an upstream infrastructure is much harder than mocking an upstream service like in your application code because of the difference in the level of abstraction.

Fortunately, being unable to test something locally doesn't mean you have to test it in production. You should still have at least one other environment lower than production, whether you call it testing, QA, staging, or something else. Deploying a testing environment is a test itself, for your infrastructure code.

Assertions make tests possible, and for each IaC tool, there are companion tools to help you verify the actual end state of an infrastructure change against the desired. For Terraform, there is terratest, for CloudFormation there is taskcat, and inspec is for Chef. Even if you don't use these assertion tools, simply running your integration tests prepared for the application code would many expose potential problems in your infrastructure code. If you use CICD to apply your infrastructure changes, put the pipeline steps of applying changes to lower environment AND integration testing before the step to change production infrastructure.

However, testing and assertion in a lower environment only make sense when you run the same infrastructure code for your production environment. Take some time to modularize and reuse the infrastructure code you share among environments instead of copy/paste them around.

Sometimes, instead of full-fledged assertive testing, static code analysis could be a much more pragmatic target to automate. And yes, you can do static code analysis on your infrastructure. Most IaC tools included CLI commands to validate and lint your infrastructure code out of the box, and third-party tools can go beyond that and find certain misconfigurations and security-related issues as well. Below is an example of checkov scanning an IaC stack coded with Terraform, triggered via MuseDev.

Question 5: Will IaC mitigate my Tech Debt?

Infrastructure as Code, by itself, is just a tool like the Cloud or the DevOps or the AI, will not magically solve your problem. You have to decide if it is the right tool for your particular fix.

We usually consider technical debt as a combination of design debt and code debt [6], but here let's take either level apart and evaluate their relationships with IaC:

IaC and Design Debt

Here we limit the term "design" at the architectural level, rather than the application level.

Although in an ideal world a good architect should defer all the decisions at the infrastructure level as long as possible [7] to avoid premature constraints, the rubber has to meet the road eventually. To materialize the system, you have to decide on the exact type of DB, queue, server, logging, monitoring, alerting, and everything that you initially abstracted away from the developers of the application code. And some of the decisions are bound to be wrong the first time however experienced you are, because of the lack of information, which would be collected only after the initial decisions are made.

You will want to be able to refactor your infrastructure, continuously in some cases, after you start to get more information. You will want to be agile. And this is when you reap the benefits of Infrastructure as Code.

In places where IaC was not all adopted or implemented incompletely as an afterthought, I've seen infrastructure such as application servers drifted from their original state so much that no one dares to make changes on them. They are called by some "snowflake systems" (don't confuse with Snowflake Data Warehouse) because they are so unique that no one can reproduce them [2]. With the existence of such systems, any attempt to refactor your code would either fail or get dramatically slowed down, falling into the automation fear spiral [8].

Contrarily, proper implementation of IaC would result in disposable and replaceable modules of the infrastructure that you can quickly troubleshoot and resolve failures. The effort and risk of swapping new components in and old ones out would be finally recused to a level that you can experiment and refactor the details of your architectural design. And this is how IaC helps you solve your tech debt.

IaC and Code Debt

When you talk about spaghetti, smelly, inflexible or legacy code, you know you are talking about the code debt type of tech debt.

Unfortunately, Infrastructure as Code, however good it is, will not help you refactor your application code. In fact, these two types of code shouldn't even know the details of each other.

Ideally, thanks to the power of the cloud and containerization, your application layer should be decoupled from the runtime platform hosting it, and the runtime platform should be even further decoupled from the infrastructure platform supporting it. Your application code and infrastructure code should be in parallel dimensions.

As a result, You still need your good old red/green/refactor cycles to gradually improve the quality of your application code. Even more, you may want to start refactoring your infrastructure code as well from now on, since they are not only "real" code, but we also have ways to test them (see Question 4).

Fortunately, nowadays we have plenty of tools to help us fight code debt. For example, static code analysis tools with properly configured ruleset could easily give you actionable refactoring suggestions for source code of almost any language [9], and modern Continuous Assurance
platforms such as MuseDev are all you need to integrate those tools into your own repo.

Summary or TLDR

Q0: Should I consider Infrastructure as Code?

Suggestion: Yes if you have a cloud environment, public, hybrid or private.
Q1: Infrastructure as Code or Serverless?

Suggestion: For small standalone services, try serverless. For complex architecture, IaC or both.
Q2: To Terraform or not to Terraform?

Suggestion: Use Terraform if you worry about vendor lock-in. Be aware of other options (such as K8S) if you worry about getting locked into Terraform itself.
Q3: When or how frequently should I apply my IaC changes?

Suggestion: The smaller the unit of your infrastructure stack is, the more frequent you can change it.
Q4: How to Test my Infrastructure Code?

Suggestion: Lint your infrastructure code before applying your changes. Optionally use assertions to verify the deployment.
Q5: Will IaC mitigate my Tech Debt?

Suggestion: IaC helps you fixing your design debt, but you need other tools to fix your code debt.