IaC Issues I have Met

#terraform #aws #cdk

IaC greatly simplifies the process of configuring, creating, and destroying infrastructure, while allowing rapid deployment of resources to the cloud. However, whether in hands-on practice or during technical discussions, a wide variety of challenges often emerge.

Preface: The Difference Between IaC and Direct API Calls

Many people ask me—or argue with me—why use IaC when creating cloud resources directly via API calls is seemingly easier and more convenient.
The key difference is that IaC introduces state management. In other words, the state of your infrastructure is saved in a specific location for visibility and tracking.
For example, in Terraform, the state is stored in a state file, whereas CloudFormation is itself a form of state management.
Some would argue that if you use Kubernetes (K8s), you don’t need state management at all. But in reality, K8s manages the state for you. Think about what happens when you update a Deployment—K8s compares the differences between versions and updates the resources accordingly.

The Realities of IaC State Management

1. Property Changes Leading to Resource Deletion

This is a common and longstanding issue. Due to the nature of some cloud provider APIs, certain properties that can be modified directly in the console may require deletion and re-creation when managed via IaC.
This means we must introduce better testing and alert mechanisms.
If you're using AWS CDK, you can enable ChangeSet confirmation, and write tests to validate expected outcomes.

Takeaway: Always test thoroughly before changes and review the change diff carefully.

2. Drift Management

Where there is state management, there will be drift.
Drift refers to the inconsistency between IaC code and the actual infrastructure—even when the infrastructure was originally created using that very code.

Example:
Colleague A uses CloudFormation to create several resources. Later, Colleague B deletes one specific S3 bucket directly from the AWS Console. This introduces drift.

Different tools handle drift differently. For instance, Terraform offers a terraform refresh and terraform plan process that compares the current infrastructure against the state file. But ultimately, the best way to prevent drift is to avoid modifying resources outside the IaC workflow.

Takeaway: Avoid making changes through the console or APIs. Ensure all modifications go through your IaC code.

3. Resource Dependencies and Deployment Order

Here’s a small story:
I once had a task to use Terraform to create a server and deploy a Grafana service. Sounds straightforward. But eventually, I had to completely restructure the directory.

Why?
Because I built the code from scratch and added resources incrementally. After setting up Grafana, I used the Grafana provider to import dashboards. However, when I tore everything down and redeployed from zero, things broke.
Terraform doesn’t enforce execution order between different providers, so everything was executed in parallel. But the Grafana provider needed a valid endpoint and token, which weren’t available at that point.
The solution was to separate the infrastructure into two stages: first create Grafana, then apply dashboards.

Another example:
I used CDK to build and push a container image. But the Lambda function that referenced the image wasn’t automatically updated. And since the image repository had a lifecycle policy that retained only the latest N images, multiple runs led to the Lambda pointing to a now-deleted image.
Worse yet, Lambda didn't fail immediately—it failed later, making the issue difficult to detect.

Takeaway: Resource dependencies and execution order matter greatly in IaC. Always document clearly or use CI/CD tools to orchestrate deployment steps properly.

4. The Trouble with Custom Scripts

Not everything can be accomplished using the built-in capabilities of IaC tools. Both AWS CDK and Terraform offer mechanisms for executing custom scripts or direct API calls to handle such cases.

However, these custom scripts often lack proper lifecycle management. They’re much harder to track and clean up compared to native constructs. In some cases, they can't be destroyed or recreated automatically, leading to inconsistencies.

Takeaway: Avoid overusing custom scripts. Stick with built-in constructs whenever possible.

Summary
Infrastructure as Code (IaC) provides powerful advantages for managing cloud resources at scale, including automation, consistency, and repeatability. However, effective use of IaC also comes with challenges—such as managing resource state, preventing drift, handling complex dependencies, and avoiding the pitfalls of custom scripting. This article explores practical issues commonly encountered in real-world IaC implementations and offers actionable recommendations to help teams build more reliable and maintainable infrastructure workflows.