DEV Community

Sudarshan Thakur
Sudarshan Thakur

Posted on

How I Built a Terraform Plan JSON Parser in Python

Most DevOps engineers have run terraform plan thousands of times. Very few have looked at what comes out when you add the -json flag.

I spent a couple of weeks inside that JSON output while building tfdrift, an open-source drift detection tool. What I found was a surprisingly well-structured format — and a handful of gotchas that aren't documented anywhere obvious.

This post is the technical deep-dive I wish I'd had when I started. If you're building any kind of tooling on top of Terraform — drift detection, policy enforcement, cost estimation, change analysis — this is the stuff you'll need to know.

Getting the JSON output

First, the basics. You can't just pipe terraform plan to get JSON. The text output and the JSON output are completely different codepaths. You need to do it in two steps:

# Step 1: Generate a plan file
terraform plan -out=tfplan -input=false -no-color

# Step 2: Convert to JSON
terraform show -json tfplan
Enter fullscreen mode Exit fullscreen mode

The -out flag saves the plan as a binary file (not human-readable), and terraform show -json converts that binary into structured JSON. You can't skip step 1 — there's no terraform plan -json flag that gives you JSON directly to stdout. This tripped me up initially.

In Python, that looks like:

import subprocess
import json

def get_plan_json(workspace_path: str) -> dict:
    # Generate the plan file
    subprocess.run(
        ["terraform", "plan", "-out=tfplan", 
         "-input=false", "-no-color"],
        cwd=workspace_path,
        capture_output=True,
        text=True,
        timeout=600,
    )

    # Convert to JSON
    result = subprocess.run(
        ["terraform", "show", "-json", "tfplan"],
        cwd=workspace_path,
        capture_output=True,
        text=True,
        timeout=120,
    )

    return json.loads(result.stdout)
Enter fullscreen mode Exit fullscreen mode

A few things to notice here. The timeout on plan is 10 minutes — for large workspaces with many resources, plan can take a while because it's refreshing state against the cloud provider APIs. The timeout on show is shorter because it's just reading a local file. And we always set -input=false because we're running non-interactively — without it, Terraform might hang waiting for user input.

The structure of plan JSON

The JSON output has a lot of fields, but for drift detection, you really only care about one: resource_changes. Here's what a simplified version looks like:

{
  "format_version": "1.2",
  "terraform_version": "1.7.0",
  "resource_changes": [
    {
      "address": "aws_security_group.api_sg",
      "mode": "managed",
      "type": "aws_security_group",
      "name": "api_sg",
      "provider_name": "registry.terraform.io/hashicorp/aws",
      "change": {
        "actions": ["update"],
        "before": {
          "ingress": [
            {
              "from_port": 443,
              "to_port": 443,
              "cidr_blocks": ["10.0.0.0/8"]
            }
          ],
          "tags": {"Name": "api-sg"}
        },
        "after": {
          "ingress": [
            {
              "from_port": 443,
              "to_port": 443,
              "cidr_blocks": ["10.0.0.0/8"]
            },
            {
              "from_port": 22,
              "to_port": 22,
              "cidr_blocks": ["0.0.0.0/0"]
            }
          ],
          "tags": {"Name": "api-sg"}
        },
        "after_sensitive": {},
        "after_unknown": {}
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Each entry in resource_changes represents one resource that Terraform wants to modify. The change object gives you a before snapshot (current state in the cloud) and an after snapshot (what Terraform wants it to look like). The actions array tells you what kind of change it is.

Parsing the actions field

The actions field is where it gets interesting. It's an array, not a string, and the possible values are:

# Single actions
["no-op"]     # Nothing changed
["create"]    # Resource exists in code but not in cloud
["delete"]    # Resource exists in cloud but not in code
["update"]    # Resource exists in both but attributes differ
["read"]      # Data source refresh

# Compound actions (two-element arrays)
["delete", "create"]  # Must destroy and recreate
["create", "delete"]  # Create new before destroying old
Enter fullscreen mode Exit fullscreen mode

That last distinction matters. ["delete", "create"] means Terraform will delete the resource first, then create the replacement — there will be downtime. ["create", "delete"] means Terraform creates the new one first, then removes the old one — no downtime. The create_before_destroy lifecycle setting in your Terraform code controls which one you get.

Here's how I mapped these to a simpler enum:

from enum import Enum

class ChangeAction(str, Enum):
    CREATE = "create"
    UPDATE = "update"
    DELETE = "delete"
    REPLACE = "replace"
    NO_OP = "no-op"

ACTION_MAP = {
    "create": ChangeAction.CREATE,
    "update": ChangeAction.UPDATE,
    "delete": ChangeAction.DELETE,
    "create,delete": ChangeAction.REPLACE,
    "delete,create": ChangeAction.REPLACE,
    "no-op": ChangeAction.NO_OP,
    "read": ChangeAction.NO_OP,
}

def parse_action(actions: list[str]) -> ChangeAction:
    key = ",".join(actions)
    return ACTION_MAP.get(key, ChangeAction.UPDATE)
Enter fullscreen mode Exit fullscreen mode

I collapsed both compound actions into a single REPLACE because for drift detection, the direction doesn't matter — either way, the resource is being recreated.

Extracting attribute-level diffs

This is the core of what makes severity classification possible. The before and after objects are full resource representations, so you can diff them attribute by attribute:

def extract_changes(before: dict, after: dict) -> list[dict]:
    """Find which specific attributes changed."""
    changes = []
    all_keys = set(before.keys()) | set(after.keys())

    for key in sorted(all_keys):
        old_val = before.get(key)
        new_val = after.get(key)

        if old_val != new_val:
            changes.append({
                "attribute": key,
                "old_value": old_val,
                "new_value": new_val,
            })

    return changes
Enter fullscreen mode Exit fullscreen mode

Simple, right? Mostly. There are a few gotchas.

Gotcha 1: Sensitive values

Some attributes are marked as sensitive — passwords, API keys, things you don't want showing up in logs. Terraform handles this through the after_sensitive field:

{
  "after": {
    "password": "newsecret123",
    "username": "admin"
  },
  "after_sensitive": {
    "password": true,
    "username": false
  }
}
Enter fullscreen mode Exit fullscreen mode

You need to check after_sensitive before logging or displaying values:

def extract_changes(before, after, after_sensitive=None):
    changes = []
    after_sensitive = after_sensitive or {}
    all_keys = set(before.keys()) | set(after.keys())

    for key in sorted(all_keys):
        old_val = before.get(key)
        new_val = after.get(key)

        if old_val != new_val:
            is_sensitive = (
                isinstance(after_sensitive, dict) 
                and after_sensitive.get(key, False)
            )
            changes.append({
                "attribute": key,
                "old_value": "(sensitive)" if is_sensitive else old_val,
                "new_value": "(sensitive)" if is_sensitive else new_val,
            })

    return changes
Enter fullscreen mode Exit fullscreen mode

I learned this one the hard way when a database password showed up in a Slack notification during testing. Don't be me.

Gotcha 2: Nested objects and lists

Some attributes are deeply nested. A security group's ingress attribute isn't a simple value — it's a list of rule objects, each with from_port, to_port, protocol, and cidr_blocks. When you compare before["ingress"] to after["ingress"], Python's != operator works because it does deep comparison on lists and dicts. But the diff you get is "ingress changed" — you lose the granularity of which rule was added or removed.

For tfdrift, I made the decision to report at the attribute level, not the nested value level. So you get "ingress changed" rather than "a new rule was added allowing port 22 from 0.0.0.0/0." This is a tradeoff — you lose some detail, but you gain simplicity and the ability to do pattern-based severity matching on the attribute name.

If you need deeper diffing, you'd want something like deepdiff:

from deepdiff import DeepDiff

diff = DeepDiff(before["ingress"], after["ingress"])
Enter fullscreen mode Exit fullscreen mode

But for severity classification, knowing that ingress changed is enough to classify it as Critical.

Gotcha 3: Null vs. absent

Terraform's JSON output isn't always consistent about null values. Sometimes a missing attribute is null, sometimes it's absent from the object entirely. Your comparison code needs to handle both:

old_val = before.get(key)  # Returns None if key doesn't exist
new_val = after.get(key)

# Both None/absent means no change
if old_val is None and new_val is None:
    continue
Enter fullscreen mode Exit fullscreen mode

Gotcha 4: Data sources show up in resource_changes

This surprised me. Data sources (data.aws_ami.latest, data.aws_vpc.main, etc.) appear in the resource_changes array with a mode of "data". They're not real resources — they're just lookups. You need to filter them out:

for change in plan_json.get("resource_changes", []):
    if change.get("mode") == "data":
        continue  # Skip data sources
Enter fullscreen mode Exit fullscreen mode

If you don't, you'll get false drift alerts every time a data source refreshes with slightly different results.

Gotcha 5: Module addresses

Resources inside modules have a module_address field:

{
  "address": "aws_instance.web",
  "module_address": "module.compute.module.web_tier",
  "type": "aws_instance",
  "name": "web"
}
Enter fullscreen mode Exit fullscreen mode

The address field alone is aws_instance.web, which isn't unique if you have the same resource name in multiple modules. The full address is module.compute.module.web_tier.aws_instance.web. In tfdrift I handle this like:

@property
def full_address(self) -> str:
    if self.module:
        return f"{self.module}.{self.address}"
    return self.address
Enter fullscreen mode Exit fullscreen mode

Handling variable files

Here's something I discovered only when testing against real infrastructure: a huge number of workspaces require variable files to plan successfully. Without the right .tfvars file, terraform plan fails immediately with "No value for required variable."

My first scan of a real environment found 150+ workspaces. About 80% of the subdirectory workspaces failed because they expected variables to be passed in. The main workspace had a terraform.tfvars file and worked fine.

The fix was auto-detection:

def find_var_files(workspace_path: Path) -> list[str]:
    """Find .tfvars files in a workspace directory."""
    var_files = []

    # terraform.tfvars is the default — always include if present
    default = workspace_path / "terraform.tfvars"
    if default.exists():
        var_files.append(str(default))

    # Also check for other .tfvars files
    for f in sorted(workspace_path.glob("*.tfvars")):
        if f.name == "terraform.tfvars":
            continue  # Already added
        if f.name.endswith(".auto.tfvars"):
            continue  # Terraform loads these automatically
        var_files.append(str(f))

    return var_files
Enter fullscreen mode Exit fullscreen mode

Then when building the plan command:

plan_cmd = ["terraform", "plan", "-out=tfplan", 
            "-input=false", "-no-color"]

for vf in find_var_files(workspace_path):
    plan_cmd.extend(["-var-file", vf])
Enter fullscreen mode Exit fullscreen mode

This single change took my scan success rate from about 20% to 85% on that same codebase.

The -detailed-exitcode flag

Here's a useful trick: terraform plan supports a -detailed-exitcode flag that changes the exit code behavior:

  • Exit code 0: Plan succeeded, no changes
  • Exit code 1: Error
  • Exit code 2: Plan succeeded, changes detected

This means you can quickly check for drift without parsing any JSON at all:

result = subprocess.run(
    ["terraform", "plan", "-detailed-exitcode",
     "-input=false", "-no-color", "-out=tfplan"],
    cwd=workspace_path,
    capture_output=True,
    text=True,
)

if result.returncode == 0:
    # No drift — skip JSON parsing entirely
    return []
elif result.returncode == 1:
    # Error — log and move on
    return handle_error(result.stderr)
elif result.returncode == 2:
    # Drift detected — now parse the JSON
    return parse_plan_json(workspace_path)
Enter fullscreen mode Exit fullscreen mode

This is a nice optimization because it means you only run terraform show -json (which can be slow for large state files) when there's actually something to look at.

Putting it all together

Here's the complete flow in simplified form:

def scan_workspace(workspace_path: str) -> list[DriftedResource]:
    """Scan a single workspace for drift."""

    # 1. Run terraform init
    subprocess.run(
        ["terraform", "init", "-backend=true", 
         "-input=false", "-no-color"],
        cwd=workspace_path,
        capture_output=True,
    )

    # 2. Build plan command with var files
    plan_cmd = [
        "terraform", "plan", "-detailed-exitcode",
        "-input=false", "-no-color", "-out=tfplan",
    ]
    for vf in find_var_files(Path(workspace_path)):
        plan_cmd.extend(["-var-file", vf])

    # 3. Run plan
    result = subprocess.run(
        plan_cmd, cwd=workspace_path,
        capture_output=True, text=True,
    )

    if result.returncode != 2:
        return []  # No drift or error

    # 4. Get JSON output
    show = subprocess.run(
        ["terraform", "show", "-json", "tfplan"],
        cwd=workspace_path,
        capture_output=True, text=True,
    )
    plan_json = json.loads(show.stdout)

    # 5. Parse changes
    drifted = []
    for change in plan_json.get("resource_changes", []):
        if change.get("mode") == "data":
            continue

        actions = change["change"]["actions"]
        action = parse_action(actions)

        if action == ChangeAction.NO_OP:
            continue

        before = change["change"].get("before") or {}
        after = change["change"].get("after") or {}
        sensitive = change["change"].get("after_sensitive") or {}

        attr_changes = extract_changes(before, after, sensitive)

        drifted.append(DriftedResource(
            address=change["address"],
            resource_type=change["type"],
            action=action,
            changes=attr_changes,
            module=change.get("module_address"),
        ))

    # 6. Clean up plan file
    Path(workspace_path, "tfplan").unlink(missing_ok=True)

    return drifted
Enter fullscreen mode Exit fullscreen mode

That's roughly 60 lines to go from a Terraform workspace directory to a structured list of drifted resources with attribute-level detail. The actual tfdrift implementation adds severity classification, ignore rules, error handling, and reporting on top of this, but the core parsing logic is what you see here.

Lessons learned

Shell out, don't replicate. I briefly considered using the Terraform provider SDKs directly (like boto3 for AWS) to compare state. That would've been months of work and broken with every provider update. Shelling out to terraform plan -json gives you free compatibility with every provider, every backend, and every Terraform version since 0.12. The tradeoff is requiring a Terraform installation, but anyone who needs drift detection already has one.

The JSON format is stable but undocumented. HashiCorp has documentation for the JSON output format, but it's sparse. The format has been stable since Terraform 0.12 (2019), and I haven't encountered breaking changes across versions. But it's not formally versioned, so in theory a future release could change things. Supporting OpenTofu via a --binary flag is a hedge against that.

Test against real infrastructure early. I built the parser against synthetic test data first and thought it was done. Then I pointed it at a real codebase with 150+ workspaces and discovered the variable file problem, the module address problem, and several edge cases with nested attributes. Synthetic tests are necessary but not sufficient.

Clean up after yourself. The tfplan binary file that terraform plan -out creates will accumulate if you don't delete it. In a scan across 150 workspaces, that's 150 binary files sitting in various directories. Always clean up in a finally block.

Try it yourself

All of this code is in the tfdrift source:

Or just install it:

pip install tfdrift
tfdrift scan --path ./your-terraform-dir --verbose
Enter fullscreen mode Exit fullscreen mode

If you're building your own Terraform tooling and run into edge cases I didn't cover, I'd love to hear about them. Open an issue or drop a comment.


This is part 3 of a series on infrastructure drift detection. Part 1: I Built a Free Terraform Drift Detector — Here's Why. Part 2: Why Severity Classification Changes Everything About Drift Detection. Coming next: Setting Up Continuous Drift Monitoring With GitHub Actions and Slack.

Top comments (0)