Shrestha Pandey

Posted on Apr 1

Why Terraform Breaks After Day-1 And How Terraform Actions Fix It

#vickybytes #terraform #aws #devops

Let me start with something most infrastructure engineers might not say out loud — Terraform solves Day-1 beautifully and then kinda leaves you hanging.

You write your HCL, run terraform apply, and everything is provisioned perfectly. The state file appears impeccable. But six months later that same infrastructure has been poked, patched, manually changed and silently drifted away from what terraform thinks exists. No one realizes this until something breaks in production.

This article is about that “gap” between provisioning and actually managing infrastructure across its entire lifetime.

Day-2 Is Where Infrastructure Goes to Die (Slowly)

When a full stack is provisioned onto AWS using Terraform it has a good state and everything is the same, and then after some time passes and a deployment fails, someone logs into the console and changes a security group rule; now the deployment has been successful… but this change has not been documented and no tickets have been raised regarding this change.

When they run the scheduled terraform apply, Terraform sees the difference and resets the security group to the original state, resulting in production breaking. Everyone is confused because there were no code changes made.

The root cause of this issue is that the tools have not been designed for such usage; Terraform's core capability was to provide an infrastructure provisioning capability.

Therefore, what are teams doing for their Day-2 operations? Most have a combination of:

Bash scripts that contain parts nobody understands
AWS Console changes that are made manually and never documented
Ad-hoc Ansible runs that don't tie back to Terraform state in any way
Lambda functions that are each triggering another Lambda function creating a non-traceable chain

In total, over 30 different tools are managing a single hybrid infrastructure estate, which is being actively managed by organizations in the field.

The Lifecycle Nobody Talks About Enough

Infrastructure has four phases and most of the industry focuses heavily over two of them.

The first phase, or "Day-0", is the "Build Phase." In this phase an organisation will form their infrastructure and define policies. There has not been any provisioning yet and is done in partnership with the platform and security teams.

The second phase, or "Day-1", is "Deploy Phase." In this phase terraform apply will run, infrastructure will be built, and the application teams will deploy their workloads. This is where terraform really starts to show its capabilities.

Day-2 or "Manage Phase." This phase is where management happens, patches are installed, configurations are changed, certificates are renewed and scaled as needed and where compliance is checked for validity and accuracy. Day 2 can take years to complete and it is also were all of the operational pain will occur. Terraform traditionally has no place in this phase.

Day-N "Decommission Phase." This phase is where everything is removed and cleaned up.

Over the last ten years the DevOps industry has been focused on perfecting Day-1 tooling; however, there are very few tools available for Day-2.

Terraform Actions — What Changed in v1.14

Terraform Actions were added as stable functionality in Terraform v1.14 and were unveiled at HashiConf 2025. Now, providers can execute an action that does more than just CRUD - calling a lambda function, stopping an EC2, invalidating a CloudFront cache, or triggering an Ansible playbook.

These new actions are located in their own top-level action block in your HCL. Terraform can automatically execute them based on event triggers during a resource's lifecycle, or they can be invoked manually via the CLI without the need to do a complete terraform apply.

You can invoke an operational action (such as calling a lambda to warm up a cache) without having Terraform re-evaluate the entire state of your infrastructure. This is a significant change in how you will use your infrastructure from now on.

The AWS provider currently has:

aws_lambda_invoke
aws_ec2_stop_instance
aws_cloudfront_create_invalidation

How Actions Actually Work — The Syntax

There are two pieces. The action block itself, and the trigger that fires it.

Defining an Action

action "aws_lambda_invoke" "warm_cache" {
  config {
    function_name = aws_lambda_function.cache_warmer.function_name
    payload = jsonencode({
      source = "terraform_action"
    })
  }
}

Note the config {} wrapper. Provider-specific arguments go inside config, not directly in the action block.

Meta-arguments like count and provider exist outside config:

action "aws_lambda_invoke" "warm_cache" {
  count    = var.invoke_on_deploy ? 1 : 0
  provider = aws.us_east_1
  config {
    function_name = aws_lambda_function.cache_warmer.function_name
    payload       = jsonencode({ source = "terraform_action" })
  }
}

Triggering an Action on Resource Lifecycle Events

This goes inside the resource's lifecycle block:

resource "aws_lambda_function" "api" {
  function_name = "my-api-handler"
  # ... rest of config

  lifecycle {
    action_trigger {
      events  = [after_create, after_update]
      actions = [action.aws_lambda_invoke.warm_cache]
    }
  }
}

Two main things to understand:

events uses unquoted keywords — after_create and after_update
actions is plural and takes a list, not a single reference

You can also add a condition to guard the action:

lifecycle {
  action_trigger {
    events    = [after_create]
    actions   = [action.ansible_playbook.patch_instance]
    condition = var.enable_auto_patching
  }
}

When condition is false, the action is skipped completely. This is useful when the configuration should exist but only run in certain environments, like production.

Running Actions from the CLI

This is where it gets useful for Day-2 workflows:

# Just plan the action, don't run it
terraform plan -invoke=action.aws_lambda_invoke.warm_cache

# Actually run the action
terraform apply -invoke=action.aws_lambda_invoke.warm_cache

Terraform only executes that one action. No evaluation or change of any other part of your configuration occurs. Each action can only be executed once at a time; therefore, multiple -invoke cannot be run in a single command.

Provisioning EC2 + Immediate Patching via Ansible Automation Platform

One of the most important and widely used use cases is linking EC2 provisioning and automated patching through Ansible Automation Platform (AAP).

The challenge it solves is simple; there are usually many security patches pending for an Ubuntu AMI that has been provisioned several months prior. If EC2 instances are provisioned, and then you manually take the time to patch each one independently, then at some point (most likely within 30 days) you will not patch an instance you provisioned. Thus, the solution is to link the patching process to the lifecycle of the Terraform instance provisioning so that patching cannot be missed.

The Terraform Side

variable "instance_count" {
  type    = number
  default = 2
}

variable "ubuntu_ami" {
  type        = string
  description = "AMI ID — use a recent Ubuntu LTS, patching will handle the rest"
}

variable "aap_controller_url" {
  type      = string
  sensitive = true
}

variable "aap_oauth_token" {
  type      = string
  sensitive = true
}

variable "allow_instance_reboot" {
  type    = bool
  default = false
}

resource "aws_instance" "app_servers" {
  count         = var.instance_count
  ami           = var.ubuntu_ami
  instance_type = "t3.medium"
  subnet_id     = aws_subnet.public.id
  key_name      = aws_key_pair.deployer.key_name

  vpc_security_group_ids = [aws_security_group.allow_ssh.id]

  tags = {
    Name      = "app-server-${count.index}"
    ManagedBy = "terraform"
  }

  lifecycle {
    action_trigger {
      events  = [after_create, after_update]
      actions = [action.ansible_aap_job.patch_servers]
    }
  }
}

The after_update event is critical; should an instance be replaced (due to AMI update, instance type modification, or any reason that would force a new instance to be created), the patching will occur on the newly-created instance automatically without any manual intervention required.

The Action Block

action "ansible_aap_job" "patch_servers" {
  config {
    controller_url    = var.aap_controller_url
    oauth_token       = var.aap_oauth_token
    job_template_name = "EC2 Linux Patching"
    extra_vars = jsonencode({
      vm_hosts = [
        for instance in aws_instance.app_servers : {
          instance_id = instance.id
          public_ip   = instance.public_ip
        }
      ]
      allow_reboot = var.allow_instance_reboot
    })
  }
}

Credentials are stored in HCP Terraform's sensitive variable store. Instance IDs and IPs come straight from resource state at runtime, so AAP always gets current values.

Note: You must refer to your provider documentation to verify the argument names of your AAP action based upon the version you are using, while the structure remains valid.

The Ansible Playbook

AAP receives vm_hosts as an extra variable, builds inventory dynamically, and patches:

---
- name: Patch EC2 Instances
  hosts: all
  gather_facts: yes
  become: yes

  pre_tasks:
    - name: Wait for SSH connectivity
      ansible.builtin.wait_for_connection:
        timeout: 120
        delay: 10

    - name: Gather package facts
      ansible.builtin.package_facts:
        manager: apt

  tasks:
    - name: Update apt package index
      ansible.builtin.apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Apply security patches
      ansible.builtin.apt:
        upgrade: dist
        only_upgrade: yes
      register: patch_result

    - name: Check if reboot is required
      ansible.builtin.stat:
        path: /var/run/reboot-required
      register: reboot_required_file

    - name: Reboot if needed and allowed
      ansible.builtin.reboot:
        reboot_timeout: 300
        post_reboot_delay: 30
      when:
        - reboot_required_file.stat.exists
        - allow_reboot | default(false) | bool

  post_tasks:
    - name: Verify instance is up after patching
      ansible.builtin.ping:

/var/run/reboot-required is a file Ubuntu creates automatically when a package update (typically a kernel patch) requires a restart to take effect. The playbook checks for this file rather than blindly rebooting. And even then, it only reboots if allow_reboot is true, which is controlled from your Terraform variables.

AAP Job Template Configuration

With respect to Ansible Automation Platform:

Project is a reference to the Git repo containing your playbook.
Inventory is created dynamically from the vm_hosts value which is assigned to you by Terraform when it runs.
Credentials - Your SSH Private RSA key will be stored in AAP's credential vault and will be used to connect to the VM via SSH. This is a very secure separation of the enterprise applications and their configurations (Terraform) and how to connect to them (Ansible).

What the Full Workflow Looks Like

An engineer modifies the instance_count from 2 to 5 and then sends a Git commit with the modification.

The engineer pushes it to HCP Terraform. HCP Terraform recognizes the alteration made and initiates a plan, which indicates that Terraform will create 3 new AWS EC2 instances; and that Terraform will submit an action request once the provisioned instances are created.

After an engineer reviews and approves the proposed plan, Terraform executes the apply phase, creating three EC2 instances within AWS. At this time, the action_trigger will be invoked. Terraform calls AAP's API, sending the newly created instance IDs and public IP addresses, to activate the patching job.

The AAP will make a dynamic inventory through Terraform and subsequently will wait until all three instances can be accessed through SSH. Once that occurs, AAP will execute an apt dist-upgrade command to identify whether a reboot is required, and then reboot the instance, if allowed. Finally, AAP will, upon each instance coming back online & responding normally, send a report back to Terraform.

Upon the completion of the reports from AAP, Terraform will acknowledge completion of the run.

Other Places Actions Are Immediately Useful

CloudFront invalidation after S3 deployments

action "aws_cloudfront_create_invalidation" "bust_cache" {
  config {
    distribution_id = aws_cloudfront_distribution.website.id
    paths           = ["/*"]
  }
}

resource "aws_s3_object" "site_bundle" {
  lifecycle {
    action_trigger {
      events  = [after_update]
      actions = [action.aws_cloudfront_create_invalidation.bust_cache]
    }
  }
}

Lambda warm-up after deployments: After deployments, cold starts on the first production request are common sources of failures in Lambda functions. First, there may be cold starts on the first production request after deployments, which is typical of failures; hence, the function is invoked immediately after deployment to ensure users do not encounter any faults.

action "aws_lambda_invoke" "warm_up" {
  config {
    function_name = aws_lambda_function.api_handler.function_name
    payload       = jsonencode({ source = "warmup" })
  }
}

resource "aws_lambda_function" "api_handler" {
  lifecycle {
    action_trigger {
      events  = [after_create, after_update]
      actions = [action.aws_lambda_invoke.warm_up]
    }
  }
}

Stopping dev instances on demand

action "aws_ec2_stop_instance" "stop_dev" {
  config {
    instance_id = aws_instance.dev_server.id
  }
}

terraform apply -invoke=action.aws_ec2_stop_instance.stop_dev

Chaining multiple actions — actions is a list, order is respected, each one completes before the next starts:

lifecycle {
  action_trigger {
    events = [after_create]
    actions = [
      action.ansible_aap_job.patch_servers,
      action.aws_lambda_invoke.register_in_cmdb,
      action.aws_lambda_invoke.notify_slack
    ]
  }
}

Things That Will Catch You Out

An action that fails prevents a run's end:

By default, how long Terraform waits for an action to finish is determined by the status of the action being waited on. This allows visibility into the status of actions, but introduces potential for issues if AAP has gone down just prior to a critical deployment, preventing Terraform from being able to wait for AAP to complete. Use condition guards for actions that have minimal impact on the overall deployment if they're interrupted.

Idempotency is not a luxury:

Every time there is a resource change, the after_update event fires. That means that your playbooks and lambda handlers will be invoked multiple times over the lifecycle of your infrastructure. It is acceptable to run apt dist-upgrade multiple times but not acceptable to perform a database migration multiple times. You must design your programming for re-execution from the very beginning of the process.

Actions do not write to state:

When an action is executed there is no record of its execution being written to a statefile in terraform. The only way to tell that the action was executed is from the run history in HCP Terraform and in the logs for any other systems that were involved i.e., AAP job history, cloud watch, etc... You have to plan how you will be able to see/understand when your Terraform works based on its observability functionality.

The provider support continues to grow:

v1.14 of the AWS Provider supports a narrow set of operations (action types). Always refer to the Terraform Registry and Provider changelogs prior to assuming any operation is an action.

CLI invocation requires existing resources:

If your action references instance.id, but the instance doesn't exist in state, the -invoke option will fail during the plan phase. Actions that use CLI should reference existing resources that have already been provisioned.

The Actual Shift

Almost all infrastructure management involves Day-2 operations.

In the past, Day-2 operations were recorded in runbooks, as part of a Jenkins job that only a few people understood, or as a bash script that was modified years ago. Day-2 operations were delivered in a reactive manner, which means if something breaks, somebody performs an action.

With Terraform Actions, Day-2 operations can now be housed with the infrastructure they manage, from the same repository and pull request workflow and using the same audit trails as the infrastructure they are provisioned on. Patch management will be defined in terms of the infrastructure your organization will provision.

This kind of change reduces the number of incidents occurring at 2:00 AM.

Terraform Actions is stable from Terraform CLI v1.14.0. Check developer.hashicorp.com/terraform/language/invoke-actions for official documentation and your provider's registry page for supported action types.

Technical insights sourced from a community session on Terraform Day-2 operations.

For more developer content, visit vickybytes.com

DEV Community