DEV Community: Jari Haikonen

Working with Terraform: Where LLMs actually help

Jari Haikonen — Wed, 15 Apr 2026 04:16:33 +0000

Terraform state said one thing. The live environment said something else. The HCL config did not match either. An engineer had been doing the imports across a multi-region AWS Terragrunt project and the work had not been done correctly. Some resources were managed, some were not, and nobody had a reliable picture of which was which. That is the worst starting point for an import: not "nothing done yet" but "something done, unclear what."

Parts of that work involved checking roughly 10,000 lines of network security group rules across all four environments. That is the kind of job that could easily take weeks to do carefully by hand. With an LLM doing the mechanical work inside each step, I got through it in hours.

Two patterns worth knowing

The import work is one side of how LLMs help with Terraform. The other is module scaffolding, and it works through a different mechanism. The underlying principle is the same either way: give the model real context and it does useful work. Give it a vague prompt and you get generic output you have to reshape anyway.

On the import side, there are two distinct scenarios. The first is what I had: a remediation scenario where something was partially done, you do not know the current state clearly, and you need to figure out what is managed, what is not, and what is wrong before making any forward progress. The second is the general case of writing Terraform modules and importing existing resources into them as you go.

Both benefit from the same safety mechanism. Terraform's plan output tells you exactly what it will create, destroy, change, or import. You can paste that output directly to the LLM and say "these resources need to match 1:1 with what is running, here is what Terraform is planning." The tooling itself becomes the feedback loop.

The workflow that made it tractable

The first thing I needed was a complete picture of what actually existed in AWS. I built a small dynamic Python tool that knew how to query each resource type through the AWS API and emit structured text. Thin shell wrappers called that Python once per type and wrote the results to different files per type, fifteen resource types in total. Getting the query syntax, field coverage, and output shape right for every type is exactly the kind of repetitive work where humans make typos and skip fields. I described what I wanted for one resource type and the LLM extended the Python in consistent, correctly-structured ways for the rest. The full inventory toolset was done in under an hour.

With a live inventory in hand, I pulled the Terraform state files to JSON. State as structured JSON is much easier to work with as LLM input than navigating raw remote state. With both sides in structured form, I wrote comparison scripts that identified resources existing in AWS but absent from state or HCL config. For security groups specifically, I built Python tooling that parsed both the live AWS rules and the HCL config, normalized the representations, and reported what was unmanaged in AWS and what was phantom in HCL. The LLM was fast at this: write a parser that understands two different formats, builds a shared logical model, and diffs them. The resulting scripts were clean enough to actually maintain after review. Not production-ready out of the box, but not throwaway code either.

Before importing anything, I prioritized explicitly. I produced a structured priority document using four criteria: how many hardcoded IDs referenced each resource, whether missing ownership would block other work, how complex the import would be, and how frequently the resource changed. The LLM read the raw evidence (grep results from HCL files, state JSON, inventory output) and synthesized it into a ranked first draft, which I reviewed and adjusted. Internet gateways first, then key pairs, CloudTrail, Secrets Manager. Lambda and SageMaker explicitly deferred.

The next decision was structural, and it was mine to make. The security groups were the hardest import problem, with multiple VPCs, dozens of SGs with many rules each, and everything sitting in two monolithic files. I made the call to split each security group into its own file before writing a single import block. Deciding structure before generating code meant the migration script only had to run once.

For the imports I used declarative import {} blocks in HCL. The reason this matters for LLM work: terraform plan shows you exactly which imports are pending, which succeeded, and which have ID mismatches. That output pastes directly into the LLM as a problem statement. The import ID format for SG rules is non-obvious (<sg-id>_<direction>_<protocol>_<from_port>_<to_port>_<source>) and generating it correctly across hundreds of rules is exactly the mechanical transformation the LLM handled well. When the plan shows drift after applying, you say "these resources need to match 1:1 with what is running, here is what Terraform is planning." The model reads the plan, identifies the mismatches, and suggests config changes. This works well for missing attributes, incorrect IDs, and straightforward drift. It is less reliable for complex dependency issues or very large plan output.

Applying the imports was not the last step. I described the verification logic I needed (iterate from live AWS outward, not from HCL inward, to catch resources in both directions) and the LLM wrote the script. Iterating from HCL only confirms that HCL-tracked resources are correct. Iterating from live AWS catches both: rules missing from HCL, and HCL files for resources that no longer exist in AWS. Results: 105 security groups fully clean, 2 rules missing from HCL, 5 stale HCL files for security groups already deleted from AWS.

When documentation does the heavy lifting

The import workflow relies on feeding the LLM real files: state JSON, live inventory, actual HCL. The module writing pattern works the same way, but the input is different: your own documentation.

Once you have written down your module conventions (file layout, variable design rules, which level a resource belongs in, how things wire together) you have something the LLM can actually follow. The output fits your system rather than being generic Terraform you then have to reshape.

I had three documents I fed it. A repository README covering the level-based operating model and what each level owns. We were already using that kind of levels hierarchy on Azure Terraform, and we anchored the README to the same principles described in the Azure Terraform SRE landing zones levels documentation—grouping state by lifecycle and privilege, clear ownership between stacks—so AWS did not get a one-off taxonomy. Parallel structure across clouds made the AWS side easier for the team to reason about and operate. A modules guide covering the standard file layout, the settings object pattern, and the rules around variable schemas and defaults. An architecture document covering how modules are structured and when to use which pattern. Together they covered everything about how modules are built in that codebase.

Feed those three documents, describe the AWS resource you need modeled, ask for a complete module. What comes back follows the correct file layout, uses the settings pattern, defines optional properties with optional() and sensible defaults, exports the expected outputs. None of that required additional prompting. It was in the documentation.

What it still needed from me: the architectural judgment. Does this resource warrant its own module? Which level does it live in? How does it wire into the rest of the codebase? The model follows documented patterns reliably, but it does not reason through those decisions on its own. Where something belongs in the system is still a call that requires understanding the system.

The time saving is specific. No writing the file structure from memory, no looking up whether a property is optional in the AWS provider, no deciding what the default for retention_days or enable_deletion_protection should be. The boilerplate that normally takes 20 minutes took two.

What did not work well

The LLM only worked well when I gave it the actual files. When I did not provide the real state JSON or the real HCL config, it guessed at structure and produced plausible but wrong output. Every step in this workflow involved feeding it real input.

Large monolithic HCL files caused problems. When the entire SG config was in one 600-line file, asking the LLM to modify it directly produced errors. Splitting into per-SG files was the right call for both maintainability and LLM usability.

Config variation across environments tripped the LLM up. Three different Terragrunt config patterns existed across the four environments. The migration script handled two of the three correctly on the first attempt. The third required reading the actual file. The lesson is the same as the first point: the LLM cannot infer variation it has not seen.

The module writing pattern has the same failure mode from the other direction. When your documentation is incomplete or inconsistent, the model fills the gaps with its own defaults rather than yours. If the docs specify the pattern but do not cover what sensible defaults look like for a specific resource type, you get something plausible but not what you would have written. The documentation has to be good enough to actually be useful as context. The model reflects the quality of your documentation back at you.

The takeaway

Across both patterns, imports and module scaffolding, the LLM was not replacing judgment. It was eliminating the mechanical work: extending the shared inventory Python for each new resource type, parsing two incompatible config formats to find differences, generating hundreds of import IDs in the correct format, producing a correctly-structured module with the right variable schema and defaults.

In both cases, the precondition was the same. You supply the real context (actual files for imports, actual documentation for module writing) and you make the structural and architectural decisions yourself. The model handles the repetitive implementation. You verify the result.

That division of labor is clean when you know what you are doing. When you do not, the model still produces output. It just produces confident-looking output you are not equipped to evaluate.

If you want the broader framing behind that, there is an article on why the "very fast junior engineer" mental model is the one that makes the most sense for LLMs in DevOps work — and why domain expertise matters more with these tools, not less.

AI vs reality: Why GitLab pipelines confuse LLMs

Jari Haikonen — Wed, 08 Apr 2026 05:09:11 +0000

The model gave me perfectly valid YAML. The pipeline failed. I asked the model to fix it. It gave me more perfectly valid YAML. The pipeline failed again. After the fourth iteration I just opened the GitLab docs, found the issue in two minutes, and fixed it myself.

This is one of the most common frustrations I have seen with LLMs in DevOps work. The .gitlab-ci.yml file I am working with has 242 commits. 73 of them contain the word "fix". The theme across most of them: the YAML is valid. GitLab disagrees.

GitLab pipelines are not just YAML

GitLab CI/CD pipelines are defined in a .gitlab-ci.yml file, and yes, the format is YAML. But GitLab has its own specific implementation on top of that, with its own keywords, its own scoping rules, and its own runtime semantics. Generic YAML parsers will happily accept a file that GitLab's pipeline linter will reject. And sometimes it does not reject it at all. It just runs differently than you expected.

That gap between "valid YAML" and "valid GitLab pipeline" is where the problems live.

Three ways this plays out in practice

The changes: anchor problem

The pipeline had repeated file path lists in every rule block. The natural LLM suggestion: extract them into YAML anchors and reference them. It produced something that looked completely reasonable:

.frontend-changes: &frontend-changes
  changes:
    - "apps/frontend/**/*"
    - "packages/**/*"
    - ".gitlab-ci.yml"

.mr-frontend:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      <<: *frontend-changes

Valid YAML. Not valid GitLab CI in practice. YAML anchors are resolved before GitLab ever processes the config. GitLab only sees the expanded result. The problem is that merge keys (<<:) do not behave predictably inside nested rule structures. After the merge is applied, the resulting shape may not match what GitLab expects for a rules:changes block, so it either silently falls back to "always run" or evaluates incorrectly depending on context.

The anchors exist in the actual file. The <<: syntax is there. It just does not do what it looks like it does. The comment sitting in the production file:

# Sadly gitlab changes do not support with anchors or references to make these DRY

The extends + silent replacement problem

The next suggestion: use extends: to compose rule templates.

lint-all:
  extends: [.mr-frontend, .mr-backend]

The model expected this to combine both templates' rules, giving you OR logic: the job runs if frontend files changed or if backend files changed. That is not what happens. extends: does not merge arrays — it replaces them, with the last template winning. In the expanded configuration, lint-all ends up with only .mr-backend's rules. The frontend changes: patterns disappear without any warning.

The reason this went undetected: both templates shared most of the same changes: paths — package.json, packages/**/*, .gitlab-ci.yml, and others. The only difference was the last entry: apps/frontend/**/* versus apps/backend/**/*. Most real commits touch shared files, so the job triggered anyway. But a commit that only changes frontend code and nothing else would silently skip the lint job. That bug was live in the pipeline for months.

For explicit control over rule composition, GitLab's !reference tags let you manually assemble the rules: array from multiple sources. But even then, you only get OR logic: GitLab evaluates rules in order and the first matching rule decides the outcome. There is no way to compose AND conditions — run only when a specific branch condition AND specific file changes are both true — across reusable templates. Every combination has to be written out explicitly. The pipeline has seven of these rule templates that a model will always try to collapse into two or three. It cannot be done without changing the semantics.

# These either cant be DRY because the rules are OR and not AND
# (the - & anchors do not work really well)

The 12-minute revert

One commit consolidated a parallel: matrix Docker build into a single sequential job. The reasoning, left in a comment, was exactly the kind of thing a model writes:

# Single job: on main we run stage then latest (same Docker layer cache,
# second build is fast). Separate jobs would duplicate work; one job with
# sequential builds reuses cache.

Logically correct. The Docker layer cache argument is real. The revert came 12 minutes later with no commit message. The problem was not correctness. parallel: matrix gives you separate job entries in the pipeline UI, separate log streams, the ability to retry one variant independently, and separate pass/fail status per build type. Collapsing into one job trades all of that for a real but secondary cache win.

The model optimized for build efficiency, but the system required failure isolation and debuggability. It did not know how you use the GitLab pipeline UI when something breaks at 2am.

The shape of the problem

Every one of these failures produces valid YAML. CI Lint may pass too. It can simulate pipeline creation for the default branch, but it cannot replicate the full runtime context — which branch triggered the pipeline, which files changed in the merge request, which variables are set. The subtle rule evaluation issues only surface when the pipeline actually runs.

What YAML says	What GitLab does
`<<: *anchor` inside `rules:`	Structurally valid, semantically inconsistent
`extends: [A, B]` with `rules:`	Last template silently replaces the first; overlapping patterns hide the bug
`parallel: matrix` removed	Valid, but changes operational behavior, not just output

The LLM generates syntactically correct config. What it cannot do is predict how that config will behave in a system where the outcome depends on evaluation order, repo state, and pipeline context that are not visible in the file itself. You can give it better documentation and it will still get this wrong, because the knowledge it is missing only appears when the pipeline actually runs.

How to recognize you are in a loop

After a few rounds of this you start to recognize the pattern. The error is not actually changing between iterations, the model is adding complexity rather than addressing the root cause, and you are spending more time writing context than it would take to just look up the answer yourself.

Two or three iterations without meaningful progress is the signal. The right move is to stop, go to the GitLab documentation directly, and use GitLab's built-in CI Lint tool to validate the syntax. Find the actual constraint, fix it yourself, and re-engage the model for the work around it. The model is still useful for the majority of pipeline work. It is just not useful for the parts that require knowing GitLab's specific implementation.

The takeaway

Your tooling-specific experience is not optional when things go wrong. LLMs are useful for writing pipeline structure, generating job definitions, and handling the repetitive parts. But when something breaks in a GitLab-specific way, the fastest path forward is usually you, not the model.

And when the pipeline gives you the same error for the fourth time, open the docs.

If you are curious about the flip side of this, there is an article on two Terraform workflows where LLMs genuinely help: importing existing infrastructure and scaffolding modules from your own documentation. That one is a much more satisfying story.

A note on timing: most of the work described here was done close to a year ago, without MCP or similar tool integrations that give models direct access to documentation and live context. Models may handle some of these cases better today. The underlying gap between "valid YAML" and "valid GitLab pipeline" is still real, but your mileage may vary.

When LLMs struggle: Architecture, context, and hidden complexity

Jari Haikonen — Tue, 31 Mar 2026 06:44:35 +0000

The obvious LLM failures are easy to catch. Syntax errors, broken configs, a pipeline that refuses to run. You see the problem immediately and fix it. Those are not the ones that should worry you.

The ones that should worry you are the ones that look completely fine. The code runs. The config is valid. The output looks reasonable. And yet, when someone with more experience takes a look, the problems become obvious immediately. They were just invisible to you.

The knowledge mirror

There is a pattern I noticed pretty quickly when working on tasks outside my main area.

When I work in areas I know well, like Terraform or CI/CD pipelines, I can evaluate the model's output almost automatically. I know what good looks like, I know the common failure patterns, and I catch mistakes fast. The feedback loop is tight.

But when I work on something I know less about, that feedback loop breaks. And the problem is that the model does not help you here at all. It does not become more cautious in unfamiliar territory. It does not tell you when it is guessing. It produces the same confident, well-formatted output regardless of whether it is right or wrong.

Anthropic's engineering team documented the same behavior when building multi-agent coding harnesses. Their published observation: when asked to evaluate their own output, agents reliably respond by confidently praising it, even when the quality is obviously mediocre to a human observer. The confidence is not correlated with the actual quality of the work.

So what you end up with is a mirror: the model reflects your level of knowledge back at you. If you know the domain, you catch the mistakes. If you do not, you miss them. And the less you know, the more you are at the mercy of output you cannot properly evaluate.

In practice, for me this showed up most clearly in development tasks. There are usually several valid ways to implement the same thing in code, and the right choice depends on context, team conventions, performance requirements, and a lot of other things the model cannot know. If you are not familiar enough with those trade-offs yourself, the model will just pick one. And if you do not notice, it gets built on.

The problem with architectural decisions

LLMs are actually quite good at implementing things. Give the model a clear approach and it will execute it well. The problem is when you ask it to choose the approach.

Architectural decisions involve context the model simply does not have: your team's skill level, how much complexity the team can realistically own and maintain at once, which parts of your system are already overengineered, the operational cost of what it is about to build, your future plans. Without that, it defaults to what it has seen most in training data, which is usually the textbook approach or the most complex one, not necessarily the most appropriate one for your situation.

In DevOps this matters more than it might seem, because infrastructure decisions have long tails. A bad pattern in a Terraform module layout, a poorly thought-out pipeline structure, a dependency that should not be there, these things propagate. Fixing them later is far more expensive than catching them early.

There is also the consistency problem. The model has absorbed a lot of old documentation and outdated best practices, and it has no sense of what is current or what fits your context. So it might solve the same kind of problem in two different places in your codebase using two completely different approaches. Both technically valid, but inconsistent in ways that make things harder to maintain over time.

The practical answer is documentation: write down your decisions and conventions and feed them to the model. This genuinely helps. But the model does not always follow the rules you set for it, especially on longer tasks. You still need to review what it produces. Documentation reduces the drift, it does not eliminate it.

A concrete example: a colleague was working on changes that touched multiple Terragrunt module levels and wanted to structure it as a single PR. The constraint is straightforward: you cannot use outputs from one level in another before applying the first one, so the rule is one PR per level. The fix: use git checkout origin/main -- path_to_file to revert the level-two file back to main in your current branch, open the first PR, merge and apply it, then create a second PR with the level-two changes. She had already asked Copilot. It had given her something much longer and more complicated.

The one thing to watch: if the reverted file had significant changes in it, save or stash them before running the checkout, because it will wipe them from your branch.

A model that produces something technically correct but architecturally wrong is in some ways more dangerous than one that produces something broken, because at least broken things announce themselves.

When the model starts looping

The other failure mode that shows up regularly is looping. You give the model a problem, it gives you an answer, the answer is wrong, you tell it so, it gives you a variation, that is also wrong, and so on. Anthropic's engineering team describes the same failure in the same terms for longer agentic tasks: on complex work, the agent tends to go off the rails over time, producing increasingly elaborate answers that are no more correct than the first one.

A good example of this from my own work: I was building a GitLab CI pipeline and wanted to keep it DRY. The changes: blocks that control when jobs run were being repeated across every rule, so I asked the model to clean it up using YAML anchors. It produced something that looked completely reasonable, valid YAML, clean structure. The pipeline failed. I fed the error back. It adjusted. Still failed. A few iterations in, the suggestions were getting more elaborate but the pipeline kept breaking in the same way.

The next article in this series gets into exactly why this happens with GitLab specifically, and what the pattern looks like in practice. The signs are pretty recognizable once you have seen it a few times: the answers are getting longer but not more correct, the error is not actually changing between iterations, and you are spending more time explaining the problem than it would take to just fix it.

The right move at that point is to stop, step back, figure out the root cause yourself, and either fix it directly or come back to the model with a much more specific prompt that contains the missing context. What does not work is adding more context and hoping the next attempt will break the pattern. Usually it does not.

Senior knowledge is more important, not less

In practice I have found the opposite of what the "AI replaces engineers" conversation suggests.

With LLMs, experienced engineers spend less time writing code and config and more time reviewing output, catching bad patterns and making architectural decisions. The volume of output goes up significantly, which means the demand for quality review goes up with it. And here is the uncomfortable part: a junior engineer can now generate code faster than a senior engineer can critically audit it. The rate-limiting factor that used to keep review meaningful has been removed.

If you do not have the experience to evaluate what the model produces, you are not doing less work. You are just accumulating a gap between how much exists in your codebase and how much anyone genuinely understands. Addy Osmani calls this comprehension debt, and the pattern he describes maps closely to what I have been seeing in practice.

The value of experience has not disappeared. It has moved to a different place in the workflow.

The takeaway

LLMs struggle most where human expertise matters most: architectural decisions, trade-off reasoning, domain-specific behavior. That is not a reason to avoid using them in those areas. It is a reason to stay engaged as the expert when you do.

One of the most concrete examples of this I have seen is GitLab pipelines, where the model is technically correct about YAML and completely wrong about GitLab's implementation at the same time. If that sounds familiar, that is exactly what the next article in this series is about.

LLMs in DevOps: Why They Work Best as a "Very Fast Junior Engineer"

Jari Haikonen — Thu, 26 Mar 2026 11:37:46 +0000

I was staring at roughly 10,000 lines of network rules spread across a live cloud environment. Two environments, dev and prod, two regions each, all handled by their own separate configuration files. The task was to cross-check what had already been imported into Terraform and what hadn't, and then split the rules correctly across all those files. That kind of task could easily take weeks to do carefully by hand. With an LLM doing the heavy lifting, I was done in three hours.

That was the moment the mental model clicked for me.

And yes, these were network security rules. But here is the thing: in a Terraform import workflow, the tooling itself is the safety net. The goal is a 1:1 match between your IaC and the actual state of the environment. If the AI-generated configuration has any drift from reality, Terraform tells you immediately when you run the import. You are not trusting the AI blindly, you are using it to do the repetitive work and then letting Terraform verify the result. That is a very different risk profile from asking an LLM to design your network security from scratch.

I have been using AI tools as a regular part of my DevOps work for about a year now, not just occasionally but daily, across hobby projects, volunteer work and professional infrastructure and development work. I come at this as a lead DevOps consultant with over 20 years in IT, so I have a pretty good baseline for what good looks like and what bad looks like.

After a year of this, I have some clear opinions about what these tools are actually good for, where they fall apart, and what way of thinking about them actually helps in practice.

This is not a model comparison and not a benchmark. Those exist already. This is just what I have noticed from using LLMs in real DevOps work.

A year of real use

The work has covered:

coding in JavaScript, TypeScript, Golang and Java
IaC with Terraform and Terragrunt
configuration management with Ansible
CI/CD work on GitLab, GitHub, Docker Compose and Helm

I have tried several models, including Claude (Sonnet, Opus, Haiku), ChatGPT, Gemini and Grok, and different IDEs like VS Code and Cursor. The models do have different strengths and there are clear gaps between them, but I am not going to get into that here. What I want to talk about is what using all of them has taught me about AI-assisted DevOps work in general.

The pattern that kept repeating

Across all that work, one pattern kept showing up.

When I gave the model clear context, a well-scoped task and some constraints to work within, the output was fast and impressively good. When I gave it an open-ended problem or let things run without much correction, the quality dropped quickly. Dead code started accumulating, inconsistent patterns appeared and the model started looping through variations of the same wrong answer.

The difference was not which model I was using. The difference was how much structure I brought to the interaction.

And that structure comes directly from your own maturity and experience in the domain. The more you know, the more precisely you can specify what you want, and the better the output gets. This is probably the most underappreciated factor in how well LLMs actually perform in practice.

Compare these two prompts for the same task:

"Create pipeline that deploys my nodejs app"

versus:

"Create CI/CD pipelines for pull requests and deploying on main branch. Add quality gates to the PR pipeline: format, lint, security, build and docker build. In the main pipeline do docker builds and use the registry for cached images to make builds faster. On the Dockerfiles use multi-stage builds where possible to keep the final image small, and make sure we are not running as root. Make the pipelines DRY on the sections that overlap"

The second prompt does not just describe what to build. It reflects years of experience with CI/CD, Docker best practices and security thinking. Someone without that background would not even know to ask for those things. The model cannot supply that knowledge from its own side, it can only work with what you give it.

It is not that the model is bad. It just has no stake in the outcome and no experience to fall back on. It will produce output either way. The quality of that output depends almost entirely on the quality of the guidance behind it.

A very fast junior engineer

The mental model that finally made this click for me: an LLM behaves like a very fast junior engineer.

A good junior can produce a lot of work quickly and they follow clear instructions well. But they struggle with architectural decisions, tend to go with the most obvious approach rather than the most appropriate one, and need supervision.

They act this way not because they are useless but because they lack the context and experience to make the right call on their own. Leave them unsupervised long enough and small decisions start to compound into bigger problems.

LLMs behave exactly like this, just at roughly ten times the speed. The speed is real and genuinely useful, but it does not change the underlying dynamic.

There is an important flip side to this that is worth saying directly: the analogy only works if you actually are the senior. If you jump into a domain you know nothing about, the dynamic inverts. The model becomes the one with more apparent knowledge and you have no real basis to supervise it. You cannot catch the bad architectural decisions because you do not recognise them. That is when you get the worst outcomes: confident-sounding output that is quietly wrong in ways that take a long time to find and fix.

When you accept this framing, a few things shift in how you work:

Your job becomes that of an architect instead of a typist. You define the structure, the constraints, the approach. The model handles the execution.
Structuring the problem well matters more than prompting well. A well-defined task with clear context will beat a cleverly worded prompt for an undefined problem every time.
You still need to know your domain. The better you understand the area you are working in, the better you can guide the model and catch its mistakes. Domain expertise is not optional, it is what makes the supervision possible in the first place.

What this looks like in practice

One thing that has helped quite a lot is writing documentation and conventions that both humans and the model can use. Not AI-specific memory tools or special prompting tricks, but actual documentation that would exist for your team anyway. Things like guidelines in a Terraform modules folder, pipeline conventions, naming rules.

When that structure exists and you give the model access to it, it follows the established patterns instead of inventing new ones. The corrections get smaller and the output actually fits the system you are building.

The other thing is knowing when to stop iterating with the model and just fix something yourself. Sometimes two or three rounds of back and forth are not making progress and the model is just looping. At that point the fastest path forward is usually to step in, fix the specific issue yourself, and re-engage the model for the work around it.

The takeaway

AI is already genuinely useful in DevOps workflows. The value you get out of it scales with the quality of the supervision and structure you bring as the engineer. The model is the junior. You are the senior. That dynamic does not disappear as the tools get faster or more capable.

The rest of this series goes into the details. The next piece looks at where LLMs consistently struggle and why the failures are harder to catch than they look. After that, a concrete GitLab pipeline example that most DevOps engineers will recognise. And then the positive story: why importing existing infrastructure into Terraform is one of the best use cases for LLMs I have found.

If you have been using LLMs in DevOps or platform engineering work, I am curious what mental model you have settled on. Does the junior engineer analogy match your experience or have you found a better way to think about it?