Working with Terraform: Where LLMs actually help

#devops #ai #terraform

Terraform state said one thing. The live environment said something else. The HCL config did not match either. An engineer had been doing the imports across a multi-region AWS Terragrunt project and the work had not been done correctly. Some resources were managed, some were not, and nobody had a reliable picture of which was which. That is the worst starting point for an import: not "nothing done yet" but "something done, unclear what."

Parts of that work involved checking roughly 10,000 lines of network security group rules across all four environments. That is the kind of job that could easily take weeks to do carefully by hand. With an LLM doing the mechanical work inside each step, I got through it in hours.

Two patterns worth knowing

The import work is one side of how LLMs help with Terraform. The other is module scaffolding, and it works through a different mechanism. The underlying principle is the same either way: give the model real context and it does useful work. Give it a vague prompt and you get generic output you have to reshape anyway.

On the import side, there are two distinct scenarios. The first is what I had: a remediation scenario where something was partially done, you do not know the current state clearly, and you need to figure out what is managed, what is not, and what is wrong before making any forward progress. The second is the general case of writing Terraform modules and importing existing resources into them as you go.

Both benefit from the same safety mechanism. Terraform's plan output tells you exactly what it will create, destroy, change, or import. You can paste that output directly to the LLM and say "these resources need to match 1:1 with what is running, here is what Terraform is planning." The tooling itself becomes the feedback loop.

The workflow that made it tractable

The first thing I needed was a complete picture of what actually existed in AWS. I built a small dynamic Python tool that knew how to query each resource type through the AWS API and emit structured text. Thin shell wrappers called that Python once per type and wrote the results to different files per type, fifteen resource types in total. Getting the query syntax, field coverage, and output shape right for every type is exactly the kind of repetitive work where humans make typos and skip fields. I described what I wanted for one resource type and the LLM extended the Python in consistent, correctly-structured ways for the rest. The full inventory toolset was done in under an hour.

With a live inventory in hand, I pulled the Terraform state files to JSON. State as structured JSON is much easier to work with as LLM input than navigating raw remote state. With both sides in structured form, I wrote comparison scripts that identified resources existing in AWS but absent from state or HCL config. For security groups specifically, I built Python tooling that parsed both the live AWS rules and the HCL config, normalized the representations, and reported what was unmanaged in AWS and what was phantom in HCL. The LLM was fast at this: write a parser that understands two different formats, builds a shared logical model, and diffs them. The resulting scripts were clean enough to actually maintain after review. Not production-ready out of the box, but not throwaway code either.

Before importing anything, I prioritized explicitly. I produced a structured priority document using four criteria: how many hardcoded IDs referenced each resource, whether missing ownership would block other work, how complex the import would be, and how frequently the resource changed. The LLM read the raw evidence (grep results from HCL files, state JSON, inventory output) and synthesized it into a ranked first draft, which I reviewed and adjusted. Internet gateways first, then key pairs, CloudTrail, Secrets Manager. Lambda and SageMaker explicitly deferred.

The next decision was structural, and it was mine to make. The security groups were the hardest import problem, with multiple VPCs, dozens of SGs with many rules each, and everything sitting in two monolithic files. I made the call to split each security group into its own file before writing a single import block. Deciding structure before generating code meant the migration script only had to run once.

For the imports I used declarative import {} blocks in HCL. The reason this matters for LLM work: terraform plan shows you exactly which imports are pending, which succeeded, and which have ID mismatches. That output pastes directly into the LLM as a problem statement. The import ID format for SG rules is non-obvious (<sg-id>_<direction>_<protocol>_<from_port>_<to_port>_<source>) and generating it correctly across hundreds of rules is exactly the mechanical transformation the LLM handled well. When the plan shows drift after applying, you say "these resources need to match 1:1 with what is running, here is what Terraform is planning." The model reads the plan, identifies the mismatches, and suggests config changes. This works well for missing attributes, incorrect IDs, and straightforward drift. It is less reliable for complex dependency issues or very large plan output.

Applying the imports was not the last step. I described the verification logic I needed (iterate from live AWS outward, not from HCL inward, to catch resources in both directions) and the LLM wrote the script. Iterating from HCL only confirms that HCL-tracked resources are correct. Iterating from live AWS catches both: rules missing from HCL, and HCL files for resources that no longer exist in AWS. Results: 105 security groups fully clean, 2 rules missing from HCL, 5 stale HCL files for security groups already deleted from AWS.

When documentation does the heavy lifting

The import workflow relies on feeding the LLM real files: state JSON, live inventory, actual HCL. The module writing pattern works the same way, but the input is different: your own documentation.

Once you have written down your module conventions (file layout, variable design rules, which level a resource belongs in, how things wire together) you have something the LLM can actually follow. The output fits your system rather than being generic Terraform you then have to reshape.

I had three documents I fed it. A repository README covering the level-based operating model and what each level owns. We were already using that kind of levels hierarchy on Azure Terraform, and we anchored the README to the same principles described in the Azure Terraform SRE landing zones levels documentation—grouping state by lifecycle and privilege, clear ownership between stacks—so AWS did not get a one-off taxonomy. Parallel structure across clouds made the AWS side easier for the team to reason about and operate. A modules guide covering the standard file layout, the settings object pattern, and the rules around variable schemas and defaults. An architecture document covering how modules are structured and when to use which pattern. Together they covered everything about how modules are built in that codebase.

Feed those three documents, describe the AWS resource you need modeled, ask for a complete module. What comes back follows the correct file layout, uses the settings pattern, defines optional properties with optional() and sensible defaults, exports the expected outputs. None of that required additional prompting. It was in the documentation.

What it still needed from me: the architectural judgment. Does this resource warrant its own module? Which level does it live in? How does it wire into the rest of the codebase? The model follows documented patterns reliably, but it does not reason through those decisions on its own. Where something belongs in the system is still a call that requires understanding the system.

The time saving is specific. No writing the file structure from memory, no looking up whether a property is optional in the AWS provider, no deciding what the default for retention_days or enable_deletion_protection should be. The boilerplate that normally takes 20 minutes took two.

What did not work well

The LLM only worked well when I gave it the actual files. When I did not provide the real state JSON or the real HCL config, it guessed at structure and produced plausible but wrong output. Every step in this workflow involved feeding it real input.

Large monolithic HCL files caused problems. When the entire SG config was in one 600-line file, asking the LLM to modify it directly produced errors. Splitting into per-SG files was the right call for both maintainability and LLM usability.

Config variation across environments tripped the LLM up. Three different Terragrunt config patterns existed across the four environments. The migration script handled two of the three correctly on the first attempt. The third required reading the actual file. The lesson is the same as the first point: the LLM cannot infer variation it has not seen.

The module writing pattern has the same failure mode from the other direction. When your documentation is incomplete or inconsistent, the model fills the gaps with its own defaults rather than yours. If the docs specify the pattern but do not cover what sensible defaults look like for a specific resource type, you get something plausible but not what you would have written. The documentation has to be good enough to actually be useful as context. The model reflects the quality of your documentation back at you.

The takeaway

Across both patterns, imports and module scaffolding, the LLM was not replacing judgment. It was eliminating the mechanical work: extending the shared inventory Python for each new resource type, parsing two incompatible config formats to find differences, generating hundreds of import IDs in the correct format, producing a correctly-structured module with the right variable schema and defaults.

In both cases, the precondition was the same. You supply the real context (actual files for imports, actual documentation for module writing) and you make the structural and architectural decisions yourself. The model handles the repetitive implementation. You verify the result.

That division of labor is clean when you know what you are doing. When you do not, the model still produces output. It just produces confident-looking output you are not equipped to evaluate.

If you want the broader framing behind that, there is an article on why the "very fast junior engineer" mental model is the one that makes the most sense for LLMs in DevOps work — and why domain expertise matters more with these tools, not less.