DEV Community: Polar Squad

Working with Terraform: Where LLMs actually help

Jari Haikonen — Wed, 15 Apr 2026 04:16:33 +0000

Terraform state said one thing. The live environment said something else. The HCL config did not match either. An engineer had been doing the imports across a multi-region AWS Terragrunt project and the work had not been done correctly. Some resources were managed, some were not, and nobody had a reliable picture of which was which. That is the worst starting point for an import: not "nothing done yet" but "something done, unclear what."

Parts of that work involved checking roughly 10,000 lines of network security group rules across all four environments. That is the kind of job that could easily take weeks to do carefully by hand. With an LLM doing the mechanical work inside each step, I got through it in hours.

Two patterns worth knowing

The import work is one side of how LLMs help with Terraform. The other is module scaffolding, and it works through a different mechanism. The underlying principle is the same either way: give the model real context and it does useful work. Give it a vague prompt and you get generic output you have to reshape anyway.

On the import side, there are two distinct scenarios. The first is what I had: a remediation scenario where something was partially done, you do not know the current state clearly, and you need to figure out what is managed, what is not, and what is wrong before making any forward progress. The second is the general case of writing Terraform modules and importing existing resources into them as you go.

Both benefit from the same safety mechanism. Terraform's plan output tells you exactly what it will create, destroy, change, or import. You can paste that output directly to the LLM and say "these resources need to match 1:1 with what is running, here is what Terraform is planning." The tooling itself becomes the feedback loop.

The workflow that made it tractable

The first thing I needed was a complete picture of what actually existed in AWS. I built a small dynamic Python tool that knew how to query each resource type through the AWS API and emit structured text. Thin shell wrappers called that Python once per type and wrote the results to different files per type, fifteen resource types in total. Getting the query syntax, field coverage, and output shape right for every type is exactly the kind of repetitive work where humans make typos and skip fields. I described what I wanted for one resource type and the LLM extended the Python in consistent, correctly-structured ways for the rest. The full inventory toolset was done in under an hour.

With a live inventory in hand, I pulled the Terraform state files to JSON. State as structured JSON is much easier to work with as LLM input than navigating raw remote state. With both sides in structured form, I wrote comparison scripts that identified resources existing in AWS but absent from state or HCL config. For security groups specifically, I built Python tooling that parsed both the live AWS rules and the HCL config, normalized the representations, and reported what was unmanaged in AWS and what was phantom in HCL. The LLM was fast at this: write a parser that understands two different formats, builds a shared logical model, and diffs them. The resulting scripts were clean enough to actually maintain after review. Not production-ready out of the box, but not throwaway code either.

Before importing anything, I prioritized explicitly. I produced a structured priority document using four criteria: how many hardcoded IDs referenced each resource, whether missing ownership would block other work, how complex the import would be, and how frequently the resource changed. The LLM read the raw evidence (grep results from HCL files, state JSON, inventory output) and synthesized it into a ranked first draft, which I reviewed and adjusted. Internet gateways first, then key pairs, CloudTrail, Secrets Manager. Lambda and SageMaker explicitly deferred.

The next decision was structural, and it was mine to make. The security groups were the hardest import problem, with multiple VPCs, dozens of SGs with many rules each, and everything sitting in two monolithic files. I made the call to split each security group into its own file before writing a single import block. Deciding structure before generating code meant the migration script only had to run once.

For the imports I used declarative import {} blocks in HCL. The reason this matters for LLM work: terraform plan shows you exactly which imports are pending, which succeeded, and which have ID mismatches. That output pastes directly into the LLM as a problem statement. The import ID format for SG rules is non-obvious (<sg-id>_<direction>_<protocol>_<from_port>_<to_port>_<source>) and generating it correctly across hundreds of rules is exactly the mechanical transformation the LLM handled well. When the plan shows drift after applying, you say "these resources need to match 1:1 with what is running, here is what Terraform is planning." The model reads the plan, identifies the mismatches, and suggests config changes. This works well for missing attributes, incorrect IDs, and straightforward drift. It is less reliable for complex dependency issues or very large plan output.

Applying the imports was not the last step. I described the verification logic I needed (iterate from live AWS outward, not from HCL inward, to catch resources in both directions) and the LLM wrote the script. Iterating from HCL only confirms that HCL-tracked resources are correct. Iterating from live AWS catches both: rules missing from HCL, and HCL files for resources that no longer exist in AWS. Results: 105 security groups fully clean, 2 rules missing from HCL, 5 stale HCL files for security groups already deleted from AWS.

When documentation does the heavy lifting

The import workflow relies on feeding the LLM real files: state JSON, live inventory, actual HCL. The module writing pattern works the same way, but the input is different: your own documentation.

Once you have written down your module conventions (file layout, variable design rules, which level a resource belongs in, how things wire together) you have something the LLM can actually follow. The output fits your system rather than being generic Terraform you then have to reshape.

I had three documents I fed it. A repository README covering the level-based operating model and what each level owns. We were already using that kind of levels hierarchy on Azure Terraform, and we anchored the README to the same principles described in the Azure Terraform SRE landing zones levels documentation—grouping state by lifecycle and privilege, clear ownership between stacks—so AWS did not get a one-off taxonomy. Parallel structure across clouds made the AWS side easier for the team to reason about and operate. A modules guide covering the standard file layout, the settings object pattern, and the rules around variable schemas and defaults. An architecture document covering how modules are structured and when to use which pattern. Together they covered everything about how modules are built in that codebase.

Feed those three documents, describe the AWS resource you need modeled, ask for a complete module. What comes back follows the correct file layout, uses the settings pattern, defines optional properties with optional() and sensible defaults, exports the expected outputs. None of that required additional prompting. It was in the documentation.

What it still needed from me: the architectural judgment. Does this resource warrant its own module? Which level does it live in? How does it wire into the rest of the codebase? The model follows documented patterns reliably, but it does not reason through those decisions on its own. Where something belongs in the system is still a call that requires understanding the system.

The time saving is specific. No writing the file structure from memory, no looking up whether a property is optional in the AWS provider, no deciding what the default for retention_days or enable_deletion_protection should be. The boilerplate that normally takes 20 minutes took two.

What did not work well

The LLM only worked well when I gave it the actual files. When I did not provide the real state JSON or the real HCL config, it guessed at structure and produced plausible but wrong output. Every step in this workflow involved feeding it real input.

Large monolithic HCL files caused problems. When the entire SG config was in one 600-line file, asking the LLM to modify it directly produced errors. Splitting into per-SG files was the right call for both maintainability and LLM usability.

Config variation across environments tripped the LLM up. Three different Terragrunt config patterns existed across the four environments. The migration script handled two of the three correctly on the first attempt. The third required reading the actual file. The lesson is the same as the first point: the LLM cannot infer variation it has not seen.

The module writing pattern has the same failure mode from the other direction. When your documentation is incomplete or inconsistent, the model fills the gaps with its own defaults rather than yours. If the docs specify the pattern but do not cover what sensible defaults look like for a specific resource type, you get something plausible but not what you would have written. The documentation has to be good enough to actually be useful as context. The model reflects the quality of your documentation back at you.

The takeaway

Across both patterns, imports and module scaffolding, the LLM was not replacing judgment. It was eliminating the mechanical work: extending the shared inventory Python for each new resource type, parsing two incompatible config formats to find differences, generating hundreds of import IDs in the correct format, producing a correctly-structured module with the right variable schema and defaults.

In both cases, the precondition was the same. You supply the real context (actual files for imports, actual documentation for module writing) and you make the structural and architectural decisions yourself. The model handles the repetitive implementation. You verify the result.

That division of labor is clean when you know what you are doing. When you do not, the model still produces output. It just produces confident-looking output you are not equipped to evaluate.

If you want the broader framing behind that, there is an article on why the "very fast junior engineer" mental model is the one that makes the most sense for LLMs in DevOps work — and why domain expertise matters more with these tools, not less.

AI vs reality: Why GitLab pipelines confuse LLMs

Jari Haikonen — Wed, 08 Apr 2026 05:09:11 +0000

The model gave me perfectly valid YAML. The pipeline failed. I asked the model to fix it. It gave me more perfectly valid YAML. The pipeline failed again. After the fourth iteration I just opened the GitLab docs, found the issue in two minutes, and fixed it myself.

This is one of the most common frustrations I have seen with LLMs in DevOps work. The .gitlab-ci.yml file I am working with has 242 commits. 73 of them contain the word "fix". The theme across most of them: the YAML is valid. GitLab disagrees.

GitLab pipelines are not just YAML

GitLab CI/CD pipelines are defined in a .gitlab-ci.yml file, and yes, the format is YAML. But GitLab has its own specific implementation on top of that, with its own keywords, its own scoping rules, and its own runtime semantics. Generic YAML parsers will happily accept a file that GitLab's pipeline linter will reject. And sometimes it does not reject it at all. It just runs differently than you expected.

That gap between "valid YAML" and "valid GitLab pipeline" is where the problems live.

Three ways this plays out in practice

The changes: anchor problem

The pipeline had repeated file path lists in every rule block. The natural LLM suggestion: extract them into YAML anchors and reference them. It produced something that looked completely reasonable:

.frontend-changes: &frontend-changes
  changes:
    - "apps/frontend/**/*"
    - "packages/**/*"
    - ".gitlab-ci.yml"

.mr-frontend:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      <<: *frontend-changes

Valid YAML. Not valid GitLab CI in practice. YAML anchors are resolved before GitLab ever processes the config. GitLab only sees the expanded result. The problem is that merge keys (<<:) do not behave predictably inside nested rule structures. After the merge is applied, the resulting shape may not match what GitLab expects for a rules:changes block, so it either silently falls back to "always run" or evaluates incorrectly depending on context.

The anchors exist in the actual file. The <<: syntax is there. It just does not do what it looks like it does. The comment sitting in the production file:

# Sadly gitlab changes do not support with anchors or references to make these DRY

The extends + silent replacement problem

The next suggestion: use extends: to compose rule templates.

lint-all:
  extends: [.mr-frontend, .mr-backend]

The model expected this to combine both templates' rules, giving you OR logic: the job runs if frontend files changed or if backend files changed. That is not what happens. extends: does not merge arrays — it replaces them, with the last template winning. In the expanded configuration, lint-all ends up with only .mr-backend's rules. The frontend changes: patterns disappear without any warning.

The reason this went undetected: both templates shared most of the same changes: paths — package.json, packages/**/*, .gitlab-ci.yml, and others. The only difference was the last entry: apps/frontend/**/* versus apps/backend/**/*. Most real commits touch shared files, so the job triggered anyway. But a commit that only changes frontend code and nothing else would silently skip the lint job. That bug was live in the pipeline for months.

For explicit control over rule composition, GitLab's !reference tags let you manually assemble the rules: array from multiple sources. But even then, you only get OR logic: GitLab evaluates rules in order and the first matching rule decides the outcome. There is no way to compose AND conditions — run only when a specific branch condition AND specific file changes are both true — across reusable templates. Every combination has to be written out explicitly. The pipeline has seven of these rule templates that a model will always try to collapse into two or three. It cannot be done without changing the semantics.

# These either cant be DRY because the rules are OR and not AND
# (the - & anchors do not work really well)

The 12-minute revert

One commit consolidated a parallel: matrix Docker build into a single sequential job. The reasoning, left in a comment, was exactly the kind of thing a model writes:

# Single job: on main we run stage then latest (same Docker layer cache,
# second build is fast). Separate jobs would duplicate work; one job with
# sequential builds reuses cache.

Logically correct. The Docker layer cache argument is real. The revert came 12 minutes later with no commit message. The problem was not correctness. parallel: matrix gives you separate job entries in the pipeline UI, separate log streams, the ability to retry one variant independently, and separate pass/fail status per build type. Collapsing into one job trades all of that for a real but secondary cache win.

The model optimized for build efficiency, but the system required failure isolation and debuggability. It did not know how you use the GitLab pipeline UI when something breaks at 2am.

The shape of the problem

Every one of these failures produces valid YAML. CI Lint may pass too. It can simulate pipeline creation for the default branch, but it cannot replicate the full runtime context — which branch triggered the pipeline, which files changed in the merge request, which variables are set. The subtle rule evaluation issues only surface when the pipeline actually runs.

What YAML says	What GitLab does
`<<: *anchor` inside `rules:`	Structurally valid, semantically inconsistent
`extends: [A, B]` with `rules:`	Last template silently replaces the first; overlapping patterns hide the bug
`parallel: matrix` removed	Valid, but changes operational behavior, not just output

The LLM generates syntactically correct config. What it cannot do is predict how that config will behave in a system where the outcome depends on evaluation order, repo state, and pipeline context that are not visible in the file itself. You can give it better documentation and it will still get this wrong, because the knowledge it is missing only appears when the pipeline actually runs.

How to recognize you are in a loop

After a few rounds of this you start to recognize the pattern. The error is not actually changing between iterations, the model is adding complexity rather than addressing the root cause, and you are spending more time writing context than it would take to just look up the answer yourself.

Two or three iterations without meaningful progress is the signal. The right move is to stop, go to the GitLab documentation directly, and use GitLab's built-in CI Lint tool to validate the syntax. Find the actual constraint, fix it yourself, and re-engage the model for the work around it. The model is still useful for the majority of pipeline work. It is just not useful for the parts that require knowing GitLab's specific implementation.

The takeaway

Your tooling-specific experience is not optional when things go wrong. LLMs are useful for writing pipeline structure, generating job definitions, and handling the repetitive parts. But when something breaks in a GitLab-specific way, the fastest path forward is usually you, not the model.

And when the pipeline gives you the same error for the fourth time, open the docs.

If you are curious about the flip side of this, there is an article on two Terraform workflows where LLMs genuinely help: importing existing infrastructure and scaffolding modules from your own documentation. That one is a much more satisfying story.

A note on timing: most of the work described here was done close to a year ago, without MCP or similar tool integrations that give models direct access to documentation and live context. Models may handle some of these cases better today. The underlying gap between "valid YAML" and "valid GitLab pipeline" is still real, but your mileage may vary.

When LLMs struggle: Architecture, context, and hidden complexity

Jari Haikonen — Tue, 31 Mar 2026 06:44:35 +0000

The obvious LLM failures are easy to catch. Syntax errors, broken configs, a pipeline that refuses to run. You see the problem immediately and fix it. Those are not the ones that should worry you.

The ones that should worry you are the ones that look completely fine. The code runs. The config is valid. The output looks reasonable. And yet, when someone with more experience takes a look, the problems become obvious immediately. They were just invisible to you.

The knowledge mirror

There is a pattern I noticed pretty quickly when working on tasks outside my main area.

When I work in areas I know well, like Terraform or CI/CD pipelines, I can evaluate the model's output almost automatically. I know what good looks like, I know the common failure patterns, and I catch mistakes fast. The feedback loop is tight.

But when I work on something I know less about, that feedback loop breaks. And the problem is that the model does not help you here at all. It does not become more cautious in unfamiliar territory. It does not tell you when it is guessing. It produces the same confident, well-formatted output regardless of whether it is right or wrong.

Anthropic's engineering team documented the same behavior when building multi-agent coding harnesses. Their published observation: when asked to evaluate their own output, agents reliably respond by confidently praising it, even when the quality is obviously mediocre to a human observer. The confidence is not correlated with the actual quality of the work.

So what you end up with is a mirror: the model reflects your level of knowledge back at you. If you know the domain, you catch the mistakes. If you do not, you miss them. And the less you know, the more you are at the mercy of output you cannot properly evaluate.

In practice, for me this showed up most clearly in development tasks. There are usually several valid ways to implement the same thing in code, and the right choice depends on context, team conventions, performance requirements, and a lot of other things the model cannot know. If you are not familiar enough with those trade-offs yourself, the model will just pick one. And if you do not notice, it gets built on.

The problem with architectural decisions

LLMs are actually quite good at implementing things. Give the model a clear approach and it will execute it well. The problem is when you ask it to choose the approach.

Architectural decisions involve context the model simply does not have: your team's skill level, how much complexity the team can realistically own and maintain at once, which parts of your system are already overengineered, the operational cost of what it is about to build, your future plans. Without that, it defaults to what it has seen most in training data, which is usually the textbook approach or the most complex one, not necessarily the most appropriate one for your situation.

In DevOps this matters more than it might seem, because infrastructure decisions have long tails. A bad pattern in a Terraform module layout, a poorly thought-out pipeline structure, a dependency that should not be there, these things propagate. Fixing them later is far more expensive than catching them early.

There is also the consistency problem. The model has absorbed a lot of old documentation and outdated best practices, and it has no sense of what is current or what fits your context. So it might solve the same kind of problem in two different places in your codebase using two completely different approaches. Both technically valid, but inconsistent in ways that make things harder to maintain over time.

The practical answer is documentation: write down your decisions and conventions and feed them to the model. This genuinely helps. But the model does not always follow the rules you set for it, especially on longer tasks. You still need to review what it produces. Documentation reduces the drift, it does not eliminate it.

A concrete example: a colleague was working on changes that touched multiple Terragrunt module levels and wanted to structure it as a single PR. The constraint is straightforward: you cannot use outputs from one level in another before applying the first one, so the rule is one PR per level. The fix: use git checkout origin/main -- path_to_file to revert the level-two file back to main in your current branch, open the first PR, merge and apply it, then create a second PR with the level-two changes. She had already asked Copilot. It had given her something much longer and more complicated.

The one thing to watch: if the reverted file had significant changes in it, save or stash them before running the checkout, because it will wipe them from your branch.

A model that produces something technically correct but architecturally wrong is in some ways more dangerous than one that produces something broken, because at least broken things announce themselves.

When the model starts looping

The other failure mode that shows up regularly is looping. You give the model a problem, it gives you an answer, the answer is wrong, you tell it so, it gives you a variation, that is also wrong, and so on. Anthropic's engineering team describes the same failure in the same terms for longer agentic tasks: on complex work, the agent tends to go off the rails over time, producing increasingly elaborate answers that are no more correct than the first one.

A good example of this from my own work: I was building a GitLab CI pipeline and wanted to keep it DRY. The changes: blocks that control when jobs run were being repeated across every rule, so I asked the model to clean it up using YAML anchors. It produced something that looked completely reasonable, valid YAML, clean structure. The pipeline failed. I fed the error back. It adjusted. Still failed. A few iterations in, the suggestions were getting more elaborate but the pipeline kept breaking in the same way.

The next article in this series gets into exactly why this happens with GitLab specifically, and what the pattern looks like in practice. The signs are pretty recognizable once you have seen it a few times: the answers are getting longer but not more correct, the error is not actually changing between iterations, and you are spending more time explaining the problem than it would take to just fix it.

The right move at that point is to stop, step back, figure out the root cause yourself, and either fix it directly or come back to the model with a much more specific prompt that contains the missing context. What does not work is adding more context and hoping the next attempt will break the pattern. Usually it does not.

Senior knowledge is more important, not less

In practice I have found the opposite of what the "AI replaces engineers" conversation suggests.

With LLMs, experienced engineers spend less time writing code and config and more time reviewing output, catching bad patterns and making architectural decisions. The volume of output goes up significantly, which means the demand for quality review goes up with it. And here is the uncomfortable part: a junior engineer can now generate code faster than a senior engineer can critically audit it. The rate-limiting factor that used to keep review meaningful has been removed.

If you do not have the experience to evaluate what the model produces, you are not doing less work. You are just accumulating a gap between how much exists in your codebase and how much anyone genuinely understands. Addy Osmani calls this comprehension debt, and the pattern he describes maps closely to what I have been seeing in practice.

The value of experience has not disappeared. It has moved to a different place in the workflow.

The takeaway

LLMs struggle most where human expertise matters most: architectural decisions, trade-off reasoning, domain-specific behavior. That is not a reason to avoid using them in those areas. It is a reason to stay engaged as the expert when you do.

One of the most concrete examples of this I have seen is GitLab pipelines, where the model is technically correct about YAML and completely wrong about GitLab's implementation at the same time. If that sounds familiar, that is exactly what the next article in this series is about.

LLMs in DevOps: Why They Work Best as a "Very Fast Junior Engineer"

Jari Haikonen — Thu, 26 Mar 2026 11:37:46 +0000

I was staring at roughly 10,000 lines of network rules spread across a live cloud environment. Two environments, dev and prod, two regions each, all handled by their own separate configuration files. The task was to cross-check what had already been imported into Terraform and what hadn't, and then split the rules correctly across all those files. That kind of task could easily take weeks to do carefully by hand. With an LLM doing the heavy lifting, I was done in three hours.

That was the moment the mental model clicked for me.

And yes, these were network security rules. But here is the thing: in a Terraform import workflow, the tooling itself is the safety net. The goal is a 1:1 match between your IaC and the actual state of the environment. If the AI-generated configuration has any drift from reality, Terraform tells you immediately when you run the import. You are not trusting the AI blindly, you are using it to do the repetitive work and then letting Terraform verify the result. That is a very different risk profile from asking an LLM to design your network security from scratch.

I have been using AI tools as a regular part of my DevOps work for about a year now, not just occasionally but daily, across hobby projects, volunteer work and professional infrastructure and development work. I come at this as a lead DevOps consultant with over 20 years in IT, so I have a pretty good baseline for what good looks like and what bad looks like.

After a year of this, I have some clear opinions about what these tools are actually good for, where they fall apart, and what way of thinking about them actually helps in practice.

This is not a model comparison and not a benchmark. Those exist already. This is just what I have noticed from using LLMs in real DevOps work.

A year of real use

The work has covered:

coding in JavaScript, TypeScript, Golang and Java
IaC with Terraform and Terragrunt
configuration management with Ansible
CI/CD work on GitLab, GitHub, Docker Compose and Helm

I have tried several models, including Claude (Sonnet, Opus, Haiku), ChatGPT, Gemini and Grok, and different IDEs like VS Code and Cursor. The models do have different strengths and there are clear gaps between them, but I am not going to get into that here. What I want to talk about is what using all of them has taught me about AI-assisted DevOps work in general.

The pattern that kept repeating

Across all that work, one pattern kept showing up.

When I gave the model clear context, a well-scoped task and some constraints to work within, the output was fast and impressively good. When I gave it an open-ended problem or let things run without much correction, the quality dropped quickly. Dead code started accumulating, inconsistent patterns appeared and the model started looping through variations of the same wrong answer.

The difference was not which model I was using. The difference was how much structure I brought to the interaction.

And that structure comes directly from your own maturity and experience in the domain. The more you know, the more precisely you can specify what you want, and the better the output gets. This is probably the most underappreciated factor in how well LLMs actually perform in practice.

Compare these two prompts for the same task:

"Create pipeline that deploys my nodejs app"

versus:

"Create CI/CD pipelines for pull requests and deploying on main branch. Add quality gates to the PR pipeline: format, lint, security, build and docker build. In the main pipeline do docker builds and use the registry for cached images to make builds faster. On the Dockerfiles use multi-stage builds where possible to keep the final image small, and make sure we are not running as root. Make the pipelines DRY on the sections that overlap"

The second prompt does not just describe what to build. It reflects years of experience with CI/CD, Docker best practices and security thinking. Someone without that background would not even know to ask for those things. The model cannot supply that knowledge from its own side, it can only work with what you give it.

It is not that the model is bad. It just has no stake in the outcome and no experience to fall back on. It will produce output either way. The quality of that output depends almost entirely on the quality of the guidance behind it.

A very fast junior engineer

The mental model that finally made this click for me: an LLM behaves like a very fast junior engineer.

A good junior can produce a lot of work quickly and they follow clear instructions well. But they struggle with architectural decisions, tend to go with the most obvious approach rather than the most appropriate one, and need supervision.

They act this way not because they are useless but because they lack the context and experience to make the right call on their own. Leave them unsupervised long enough and small decisions start to compound into bigger problems.

LLMs behave exactly like this, just at roughly ten times the speed. The speed is real and genuinely useful, but it does not change the underlying dynamic.

There is an important flip side to this that is worth saying directly: the analogy only works if you actually are the senior. If you jump into a domain you know nothing about, the dynamic inverts. The model becomes the one with more apparent knowledge and you have no real basis to supervise it. You cannot catch the bad architectural decisions because you do not recognise them. That is when you get the worst outcomes: confident-sounding output that is quietly wrong in ways that take a long time to find and fix.

When you accept this framing, a few things shift in how you work:

Your job becomes that of an architect instead of a typist. You define the structure, the constraints, the approach. The model handles the execution.
Structuring the problem well matters more than prompting well. A well-defined task with clear context will beat a cleverly worded prompt for an undefined problem every time.
You still need to know your domain. The better you understand the area you are working in, the better you can guide the model and catch its mistakes. Domain expertise is not optional, it is what makes the supervision possible in the first place.

What this looks like in practice

One thing that has helped quite a lot is writing documentation and conventions that both humans and the model can use. Not AI-specific memory tools or special prompting tricks, but actual documentation that would exist for your team anyway. Things like guidelines in a Terraform modules folder, pipeline conventions, naming rules.

When that structure exists and you give the model access to it, it follows the established patterns instead of inventing new ones. The corrections get smaller and the output actually fits the system you are building.

The other thing is knowing when to stop iterating with the model and just fix something yourself. Sometimes two or three rounds of back and forth are not making progress and the model is just looping. At that point the fastest path forward is usually to step in, fix the specific issue yourself, and re-engage the model for the work around it.

The takeaway

AI is already genuinely useful in DevOps workflows. The value you get out of it scales with the quality of the supervision and structure you bring as the engineer. The model is the junior. You are the senior. That dynamic does not disappear as the tools get faster or more capable.

The rest of this series goes into the details. The next piece looks at where LLMs consistently struggle and why the failures are harder to catch than they look. After that, a concrete GitLab pipeline example that most DevOps engineers will recognise. And then the positive story: why importing existing infrastructure into Terraform is one of the best use cases for LLMs I have found.

If you have been using LLMs in DevOps or platform engineering work, I am curious what mental model you have settled on. Does the junior engineer analogy match your experience or have you found a better way to think about it?

AI/ML Platforms: Pros and Cons

Jani Ranta — Mon, 28 Oct 2024 11:33:47 +0000

When choosing between AI/ML platforms, each provider has its strengths and trade-offs:

Azure Machine Learning excels in seamless integration with Microsoft services, strong security features, and ease of use through AutoML tools, but can become costly with complex pricing and requires expertise within the Azure ecosystem.
AWS SageMaker offers extensive flexibility, scalability, and strong AWS service integration, making it ideal for large-scale applications, though it can be challenging for beginners due to its steep learning curve and potentially high costs.
Google Vertex AI is known for its user-friendly interface and superior AutoML capabilities, particularly for data-heavy operations, but has fewer pre-built models compared to AWS and can be expensive with larger datasets.
On the other hand, open-source solutions like Hugging Face offer cost savings and high customization, but demand significant technical expertise and manual setup, making them harder to scale without the right resources.

Machine Learning and AI Model Development

Category	Microsoft Azure	Amazon Web Services	Google Cloud Platform	Open Source
Machine Learning and AI Model Development	Azure Machine Learning	Amazon SageMaker	Vertex AI	MLflow, Kubeflow, Ray Serve
Image Recognition and Computer Vision	Computer Vision	Amazon Rekognition	Vision AI	DeepStack AI Server
Natural Language Processing (NLP)	Text Analytics, Language Understanding (LUIS)	Amazon Comprehend	Natural Language API	Haystack, Rasa, spaCy REST API, Hugging Face
Speech Recognition and Text-to-Speech	Azure AI Speech Service	Amazon Transcribe, Amazon Polly	Speech-to-Text, Text-to-Speech	Vosk (for Speech Recognition), Coqui TTS (for Text-to-Speech)
Chatbots and Interactive Applications	Azure Bot Services, Microsoft Copilot Studio	Amazon Lex	Dialogflow	Rasa, Botpress
Automated Text Processing and Analysis	Azure AI Document Intelligence	Amazon Comprehend	Document AI	Tesseract OCR, Apache Tika
Generative Large Language Models (LLMs)	Azure OpenAI Service	Amazon Bedrock	Vertex AI, Gemini	DeepSpeed, Haystack, Hugging Face Transformers, GPT-J/NeoX Playground, Ollama

Azure Machine Learning: A cloud-based service designed for the entire machine learning lifecycle, enabling data scientists and engineers to build, train, and deploy models at scale. It supports various frameworks and offers tools for MLOps, data preparation, and model management.

Amazon SageMaker: A fully managed service that provides tools for building, training, and deploying machine learning models quickly. It includes features like built-in algorithms, Jupyter notebooks, and model monitoring capabilities

Vertex AI: A unified platform that simplifies the machine learning workflow by integrating various tools and services for data preparation, model training, and deployment. It supports AutoML and custom training with TensorFlow and PyTorch.

MLflow: An open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment. It provides a tracking server, projects, and a model registry.

Kubeflow: An open-source machine learning toolkit for Kubernetes, designed to facilitate the deployment, orchestration, and management of machine learning workflows on Kubernetes clusters.

Ray Serve: A scalable model serving library that allows users to deploy machine learning models in production with minimal latency. It integrates seamlessly with Ray, a distributed computing framework.

Image Recognition and Computer Vision

Computer Vision: A service that provides algorithms for image analysis, including object detection, image classification, and optical character recognition (OCR) capabilities.

Amazon Rekognition: A service that makes it easy to add image and video analysis to applications, offering features like facial recognition and object detection.

Vision AI: A Google Cloud service that provides powerful image analysis capabilities through pre-trained models and custom model training

DeepStack AI Server: An open-source platform for implementing AI capabilities in applications, including image recognition and face detection.

Natural Language Processing (NLP)

Text Analytics, Language Understanding (LUIS): Azure services that provide capabilities for sentiment analysis, key phrase extraction, and language understanding for building conversational applications.

Amazon Comprehend: A natural language processing service that uses machine learning to find insights and relationships in text, such as sentiment and entity recognition.

Natural Language API: A Google Cloud service that allows developers to analyze and understand text through features like entity recognition and sentiment analysis.

Haystack, Rasa, spaCy REST API, Hugging Face: Open-source frameworks and libraries for building NLP applications, offering capabilities for intent recognition, dialogue management, and text processing.

Speech Recognition and Text-to-Speech

Azure AI Speech Service: A service that provides speech recognition and text-to-speech capabilities, enabling applications to convert speech to text and vice versa.

Amazon Transcribe, Amazon Polly: Services for automatic speech recognition and text-to-speech, allowing developers to add voice capabilities to applications easily.

Speech-to-Text, Text-to-Speech: Google Cloud services that enable audio transcription and speech synthesis, providing high-quality voice outputs for applications.

Vosk, Coqui TTS: Open-source tools for speech recognition and text-to-speech, allowing developers to integrate voice capabilities into their applications without relying on cloud services.

Chatbots and Interactive Applications

Azure Bot Services: A platform for building and deploying intelligent chatbots that can interact with users across various channels. With Microsoft Copilot Studio on Azure this service becomes very powerful.

Amazon Lex: A service for building conversational interfaces using voice and text, powered by the same technology as Alexa.

Dialogflow: A Google Cloud service for building conversational agents, offering natural language understanding and integration with various messaging platforms.

Rasa, Botpress: Open-source frameworks for developing conversational AI applications, providing tools for building, training, and deploying chatbots.

Automated Text Processing and Analysis

Azure AI Document Intelligence: A service that helps automate the extraction of information from documents, enhancing data processing workflows.

Amazon Comprehend: Also mentioned under NLP, it provides capabilities for analyzing text and extracting insights from documents.

Document AI: A Google Cloud service that automates the extraction of structured data from unstructured documents, improving data processing efficiency.

Tesseract OCR, Apache Tika: Open-source tools for optical character recognition and document parsing, enabling automated text extraction from images and documents.

Last but not least, Generative Large Language Models (LLMs)

AWS SageMaker: A fully managed machine learning (ML) service. Data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. It provides a UI experience for running ML workflows that makes SageMaker ML tools available across multiple integrated development environments (IDEs).

Azure OpenAI Service: A service that provides access to OpenAI's powerful language models, enabling developers to build applications that require natural language understanding and generation.

Amazon Bedrock: A managed service that allows developers to build and scale generative AI applications using foundation models from various providers.

Vertex AI, Gemini: Google Cloud services that facilitate the development of generative AI applications, offering access to advanced language models for various use cases.

DeepSpeed, Haystack, Hugging Face Transformers, GPT-J/NeoX Playground, Ollama: Open-source tools and frameworks for building and deploying generative AI applications, providing capabilities for training and fine-tuning large language models.

Feature/Capability	Azure Machine Learning	AWS SageMaker	Google Vertex AI	Open Source Solutions
Management Type	Fully managed	Fully managed	Fully managed	Self-managed
Ease of Use	High, with visual tools and AutoML	Moderate, with no-code options available	High, with integrated tools	Varies by tool
Model Training	Supports various frameworks, AutoML	Supports various frameworks, AutoML	Supports various frameworks, AutoML	MLflow, Kubeflow, Ray Serve
Model Deployment	Easy endpoint configuration	Easy endpoint configuration	Easy endpoint configuration	Requires manual setup
Pre-built Models	Yes, through Azure Model Gallery	Yes, through SageMaker Model Zoo	Yes, through Model Garden	Depends on the specific tool
Integration with Other Services	Strong integration with Azure services	Strong integration with AWS services	Strong integration with Google services	Varies by tool
Generative AI Support	Yes, through Azure OpenAI Service	Yes, through SageMaker	Yes, through GenAI	DeepSpeed, Hugging Face Transformers
NLP Capabilities	Comprehensive (Text Analytics, LUIS)	Comprehensive (Comprehend)	Comprehensive (Natural Language API)	Haystack, Rasa, spaCy REST API
Computer Vision Capabilities	Yes, through Computer Vision	Yes, through Amazon Rekognition	Yes, through Vision AI	DeepStack AI Server
Speech Recognition	Yes, through Azure AI Speech Service	Yes, through Amazon Transcribe	Yes, through Speech-to-Text	Vosk
Text-to-Speech	Yes, through Azure AI Speech Service	Yes, through Amazon Polly	Yes, through Text-to-Speech	Coqui TTS
Collaboration Tools	Azure ML Workspaces	SageMaker Studio for team collaboration	Vertex AI Workbench	Varies by tool
Cost Structure	Pay-as-you-go, pricing varies by usage	Pay-as-you-go, pricing varies by usage	Pay-as-you-go, pricing varies by usage	Free, but requires infrastructure
Customization	High customization options available	High customization options available	High customization options available	High, depending on the framework

How to use AWS Roles Anywhere

Janne Pohjolainen — Wed, 21 Feb 2024 09:27:20 +0000

What is AWS Roles Anywhere?

AWS Roles Anywhere enables you to use AWS Policies and AWS Roles for workloads such as servers, containers and applications that are running outside AWS without having to create long-term credentials.

AWS Roles Anywhere uses SSL certificates to manage workload access through IAM Roles and handing over short-living session keys.

How does AWS Roles Anywhere work?

A workload needs to have an SSL certificate (and a private key) that is signed by a Certificate Authority configured in AWS Roles Anywhere as a Trust Anchor. It is possible to configure external CAs or to use an AWS Private CA.

AWS Roles Anywhere Profiles map which IAM Roles are possible to be assumed by the workload. The workload is then given a short-living session keys which grants access to AWS services according to the role policies.

This is more secure than if user credentials or other long-term credentials are used, as they usually have much more access than a single role. It's also possible to set boundaries for the roles added to the Profile which override the role's policies.

What use-cases does AWS Roles Anywhere have?

These roles can be used in any kind of workload running outside AWS that needs access to the resources in the cloud. For example,

On-premises databases servers could connect to S3 to fetch data.
If you use Kubernetes in a private cloud, you can create a role that gives the application container running in the cluster permissions to specific AWS resources only.
If you already have an existing CA and PKI (Public-Key Infrastructure) system, another use case is to use AWS Roles Anywhere to use the existing certificates to grant access to AWS services.

Let's dive in deeper...

Some infrastructure is needed first

First, a Certificate Authority (CA) is needed. For this example, we will create an AWS Private CA and then create an AWS Roles Anywhere trust anchor for it.
After that, we will create an IAM Role with S3 read-only access and a Roles Anywhere Profile that links the IAM role to the trust anchor and CA.

We will not dive deep into these and concentrate more on the workload side. Terraform code used to create CA, trust anchor, profile, role and an S3 bucket for this example can be found in https://github.com/jpohjolainen/aws_roles_anywhere

Once run, Terraform will output the ARNs of the newly created CA, trust anchor, role and profile, and also a S3 bucket name.

These will be needed later on for creating a certificate for signing in to AWS and when revoking a certificate.

Speaking of certificates, let's create a new certificate request and get a certificate from the CA.

Note: Only the first private CA will be free for 30 days. If you create one and then delete it and create another, you will need to pay for the CA immediately. A private CA costs 300€ ($400) a month.

Note 2: It is also possible to create your own CA with an OpenSSL command and configure it using external CA. Here is how to do that: https://aws.amazon.com/blogs/security/iam-roles-anywhere-with-an-external-certificate-authority/

Creating a certificate and private key for our application

Create Certificate Signing Request (CSR) and a new Private Key.

Create CSR app-cert.csr and private key app-private-key. Change the Subject to your liking. C=Country, ST=State, OU=Organization Unit, O=Organization and CN=Common Name (name of app, or hostname/domain)

openssl req -new -newkey rsa:2048 \
 -out "app-cert.csr" \
 -keyout "app-private.key" \
 -subj "/C=DE/ST=Berlin/OU=DevOps/CN=app1"

The previous command prompted for password for the private key, but we need to remove it as this is for an application

openssl rsa -in "app-private.key" -out "app-private-nopass.key"

Request a certificate from the CA based on the CSR.

aws acm-pca issue-certificate \
 --certificate-authority "arn:aws:acm-pca:eu-west-1:xxxxxx:certificate-authority/zzzzzzzzz" \
 --csr "fileb://app-cert.csr" \
 --signing-algorithm "SHA256WITHRSA" \
 --validity Value=365,Type="DAYS"

The --csr option really has fileb:// instead of file://. It is used to read files in binary format in the AWS CLI.

Note: The following signing algorithms are available: SHA256WITHECDSA, SHA384WITHECDSA, SHA512WITHECDSA, SHA256WITHRSA, SHA384WITHRSA and SHA512WITHRSA.
The specified signing algorithm family (RSA or ECDSA) must match the algorithm family of the CA's secret key.

The validity type can be YEARS, MONTHS, DAYS and the number as the value, END_DATE with value of YYYYMMDDHHMMSS or ABSOLUTE with a unix timestamp as the value.

Get the CertificateARN from the reply. It is used in next section.

Download the issued certificate.

Use the AWS CLI to download the certificate from the private CA

aws acm-pca get-certificate \
 --certificate-arn "arn:aws:acm-pca:eu-west-1:xxxxxx:certificate-authority/zzzzzzzzz/certificate/af7d3bf5c562a7d91f9310da8ae6ea8d" \
 --certificate-authority-arn "arn:aws:acm-pca:eu-west-1:xxxxxx:certificate-authority/zzzzzzzzz" \
 --output json \
 |jq -r '.Certificate, .CertificateChain' > app-cert.pem

Note: This uses jq to get the certificate in JSON format. It's possible to use --output text and direct that to a file, but then you need to edit the file to move the second -----BEGIN CERTIFICATE----- to its own line.

Show the certificate

openssl x509 -in app-cert.pem -noout -text

Creating a container with an application

Once we have the certificate and the private key created, we need to have a configuration for AWS.
$HOME/.aws/config is used by AWS SDK and CLI for getting access to AWS by signing in with the certificate and assume the role.

AWS provides a tool to help with the sign-in called AWS Signing Helper. In this example, it is downloaded inside the Dockerfile when building the image.
https://docs.aws.amazon.com/rolesanywhere/latest/userguide/credential-helper.html

.aws/config

The helper tool can then be used in the file $HOME/.aws/config to login to the AWS when SDK or CLI is used. Here we need the ARNs that the Terraform code above returns. Save this to a file called aws-config. The Dockerfile expects this name:

[default]
    credential_process = /usr/local/bin/aws_signing_helper credential-process --certificate /app/app-cert.pem --private-key /app/app-private-nopass.key --trust-anchor-arn arn:aws:rolesanywhere:eu-west-1:xxxxxx:trust-anchor/yyyyyyyy --profile-arn arn:aws:rolesanywhere:eu-west-1:xxxxxx:profile/ccccccc --role-arn arn:aws:iam::xxxxxx:role/RolesAnywhere

Note: Change the --trust-anchor-arn, --profile-arn and --role-arn to values gotten from Terraform code.

Small application

Here is a small Python code to print S3 buckets. Save this to a file gets3buckets.py

import boto3
import time

def hello_s3():
    s3_resource = boto3.resource("s3")
    print("Hello, Amazon S3! Let's list your buckets:")
    for bucket in s3_resource.buckets.all():
        print(f"\t{bucket.name}")


if __name__ == "__main__":
    while True:
        hello_s3()
        time.sleep(5)

Copy the certificate, private key, aws-config and the above python code into a directory.

Dockerfile

Create a Dockerfile in the same directory as the certificates and the other files previously created:

FROM debian:stable-slim

ARG homedir=/app

RUN DEBIAN_FRONTEND=noninteractive apt-get update \
    && apt-get upgrade \
    && apt-get install --no-install-recommends -y \
        awscli \
        curl \
        python3-boto3 \
    && rm -rf /var/lib/apt/lists/*

# Download AWS Signing Helper
RUN cd /usr/local/bin \
    && curl -LO https://rolesanywhere.amazonaws.com/releases/1.1.1/X86_64/Linux/aws_signing_helper \
    && chmod 0755 aws_signing_helper

# Create user to run the app
RUN adduser --system --home "$homedir" --no-create-home --shell /bin/false userapp
RUN mkdir "$homedir" && chown userapp "$homedir"

# After this everything is run under the user
USER userapp

COPY --chown=userapp --chmod=0600 ./app-cert.pem "$homedir"
# This should never be copied inside the image. It should be mounted from outside
COPY --chown=userapp --chmod=0600 ./app-private-nopass.key "$homedir"


RUN mkdir "$homedir"/.aws \
    && chmod 0700 "$homedir"/.aws

# Copy the aws-config 
COPY --chown=userapp --chmod=0644 ./aws-config "$homedir"/.aws/config

COPY --chown=userapp --chmod=0755 gets3buckets.py "$homedir"

WORKDIR "$homedir"

CMD python3 /app/gets3buckets.py

Note: The private key should never be baked into the container except when testing locally. The private key should be in a secret store like AWS Secrets Manager and then mounted from there when using inside Kubernetes.

Build the container

docker build -t s3rolesanywhere:test .

When running the container, it will use the certificate and the private key with the aws_signing_helper tool to request AWS secrets and session keys, and assume the role. It will then print the buckets from S3 every 5 seconds.

docker run -ti --rm s3rolesanywhere:test

The output should be something like this:

Hello, Amazon S3! Let's list your buckets:
    private-ca-crl-xxxxxxxx
    ...

Revoking certificate

Get certificate serial number

To revoke a certificate, you need to have its Serial number. You can get it from the certificate with openssl command

openssl x509 -in app-cert.pem -noout -serial

Revoke certificate in AWS Private CA

Then, you can revoke the certficate in AWS Private CA with the following command

aws acm-pca revoke-certificate \
 --certificate-authority-arn <ARN of CA> \
 --certificate-serial <Serial of the cert to revoke> \
 --revocation-reason "<Reason for revoking>"

Note: Only these are valid values for --revocation-reason: AFFILIATION_CHANGED, CESSATION_OF_OPERATION, A_A_COMPROMISE, PRIVILEGE_WITHDRAWN, SUPERSEDED, UNSPECIFIED, KEY_COMPROMISE, CERTIFICATE_AUTHORITY_COMPROMISE

It may take some time for the CRL file to appear.

Download AWS Private CA CRL file

You can then download the CRL from the S3 bucket configured for CRL in the AWS Private CA.

List CRL files in the Private CA S3 Bucket (private_ca_s3_bucket is the output from Terraform)

aws s3 ls s3://<private_ca_s3_bucket>/crl

Get the CRL file from the S3 bucket

aws s3 cp s3://<private_ca_s3_bucket>/crl/xxxxxxxxx.crl .

This file is in DER format, and it needs to be in PEM format for AWS Roles Anywhere to accept.

openssl crl -inform DER -in xxxxxxxx.crl -outform PEM -out privateca.crl

Import CRL to AWS Roles Anywhere

A CRL file in PEM format needs to be then uploaded/imported to the AWS Roles Anywhere for it to revoke the access to the certificate

aws rolesanywhere import-crl \
 --crl-data privateca.crl \
 --name <give some name> \
 --trust-anchor-arn <ARN of the Turst Anchor> --enabled

Note: You can only import 2 CRLs. Even with same names it doesn't overwrite them, so you need to manually remove them. Check aws rolesanywhere list-crls and aws rolesanywhere delete-crl AWS CLI commands.

After some time, the access with that certificate should not work anymore. The error will be something like

botocore.exceptions.CredentialRetrievalError: Error when 
retrieving credentials from custom-process: 
2024/02/16 11:11:34 AccessDeniedException: Certificate revoked

Conclusion

It is a pretty neat way of giving AWS access to applications and servers outside of AWS. It is more secure than creating normal, long-term user credentials and using them for applications. Compared to normal user credentials, if the session is hijacked in transit, then it is only valid for a short while instead of potentially being live for even months or years. And if the private key is compromised, then it can only access the services configured with policies to the role.

But revoking certificates is not all that straightforward. The CA keeps up a list called Certificate Revocation List (CRL). This needs to be imported to AWS Roles Anywhere seperately. AWS Roles Anywhere doesn’t have any automated update system of the CRL, even with AWS Private CA that can push the CRL to S3. It can only be imported through AWS CLI or API, but for example Terraform does not yet have that capability.

I feel that unless a company is already heavily invested in certificates and PKI systems, the AWS Roles Anywhere brings complications not needed on top of all other user management, especially as Terraform doesn't support import-crl now, and playing with AWS Private CA or OpenSSL requires more manual work than I think is necessary.

Crossplane: How do providers work

Joonas Venäläinen — Tue, 17 Oct 2023 07:16:35 +0000

Providers are the meat around Crossplane’s bones, and they are used to extend the capabilities of Crossplane. When Crossplane is installed, it doesn't have any capabilities to interact with external systems. A core Crossplane pod will only watch the following resources.

compositeresourcedefinitions.apiextensions.crossplane.io
compositionrevisions.apiextensions.crossplane.io         
compositions.apiextensions.crossplane.io                 
configurationrevisions.pkg.crossplane.io                 
configurations.pkg.crossplane.io                         
controllerconfigs.pkg.crossplane.io                      
locks.pkg.crossplane.io                                  
providerrevisions.pkg.crossplane.io                      
providers.pkg.crossplane.io                              
storeconfigs.secrets.crossplane.io

When you install a provider, a new pod is created to Crossplane's installation namespace. This pod is a Kubernetes Controller that watches the CRDs that are also installed as part of the provider package.

To find out what different kinds of providers are available, you can check the Upbound Marketplace and crossplane-contrib repository. For this series, we are going to work with the following providers:

Those GCP providers are installed from the provider-family-gcp package. These provider-family packages are special packages that allow you to install only the provider packages you need instead of installing everything, which would mean 343 CRDs if you install the provider-gcp package instead. Crossplane also states:

On average, 30 CRDs are used from Provider packages.

Looking at the average number, you would still have ~313 CRDs in the cluster that aren't used 🤯.

Install the providers

cat <<EOF | kubectl apply --filename=-
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-gcp-storage
spec:
  package: xpkg.upbound.io/upbound/provider-gcp-storage:v0.36.0
---
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-gcp-cloudplatform
spec:
  package: xpkg.upbound.io/upbound/provider-gcp-cloudplatform:v0.36.0
---
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-terraform
spec:
  package: xpkg.upbound.io/upbound/provider-terraform:v0.10.0
EOF

After a little while, you should see the providers installed and in a healthy state

kubectl get provider
---
NAME                          INSTALLED   HEALTHY   PACKAGE                                                      AGE
provider-gcp-cloudplatform    True        True      xpkg.upbound.io/upbound/provider-gcp-cloudplatform:v0.36.0   116s
provider-gcp-storage          True        True      xpkg.upbound.io/upbound/provider-gcp-storage:v0.36.0         116s
provider-terraform            True        True      xpkg.upbound.io/upbound/provider-terraform:v0.10.0           116s
upbound-provider-family-gcp   True        True      xpkg.upbound.io/upbound/provider-family-gcp:v0.37.0          107s

Now the providers are installed and ready, we need to set up the ProviderConfig, which configures the credentials for the provider to be able to interact with external systems, in this case, with Google Cloud. You can have multiple ProviderConfigs and reference them in managed resources using providerConfigRef. ProviderConfigs are cluster-scoped resources.

You can set up a ProviderConfig per tenant when you have a multi-tenant cluster. When creating compositions, you could patch the value of providerConfigRef in managed resources with a value of spec.claimRef.namespace, which points to the namespace where the XRC was created.

Every provider has their own individual settings available when it comes to ProviderConfig. For the GCP provider, you can find all the available configuration options here and for Terraform provider here. If you need to override Controller related settings eg. ServiceAccount you can use ControllerConfig for that.

In upcoming chapters, we will create resources in Google Cloud that involve creating a bucket, serviceaccount, iam-binding, and serviceaccountkey. Use the following to configure a new service account with needed permissions in GCP:

# GCP Project ID
PROJECT_ID=""

gcloud iam service-accounts create crossplane-sa-demo --display-name "Crossplane Service Account Demo"

gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:crossplane-sa-demo@$PROJECT_ID.iam.gserviceaccount.com --role roles/storage.admin
gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:crossplane-sa-demo@$PROJECT_ID.iam.gserviceaccount.com --role roles/iam.serviceAccountAdmin
gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:crossplane-sa-demo@$PROJECT_ID.iam.gserviceaccount.com --role roles/iam.serviceAccountKeyAdmin
gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:crossplane-sa-demo@$PROJECT_ID.iam.gserviceaccount.com --role roles/storage.iamMember

Create a service account key:

gcloud iam service-accounts keys create credentials.json --iam-account=crossplane-sa-demo@$PROJECT_ID.iam.gserviceaccount.com

Create a Kubernetes secret in crossplane-system namespace that contains the previously created credentials:

kubectl create secret generic gcp-creds --from-file=creds=./credentials.json -n crossplane-system

Create ProviderConfig that uses these credentials:

cat <<EOF | kubectl apply --filename=-
apiVersion: gcp.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  projectID: $PROJECT_ID
  credentials:
    source: Secret
    secretRef:
    name: gcp-creds
    namespace: crossplane-system
    key: creds
EOF

If you run this inside GKE, using the Workload Identity for authentication is much better. You can find detailed instructions for it here.

You can also read the secret from the filesystem using fs. This might come in handy in cases where you are leveraging, for example, Hashicorp Vault with Vault Agent sidecar to inject secrets to pods. Here is a quick example of how you would configure it without going into too much detail about how to work with Vault Agent Injector:

apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  name: gcp-config
spec:
  metadata:
    annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "crossplane-providers"
    ...
---
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-gcp-storage
spec:
  package: xpkg.upbound.io/upbound/provider-gcp-storage:v0.36.0
  controllerConfigRef:
    name: gcp-config
---
apiVersion: gcp.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  projectID: $PROJECT_ID
  credentials:
    source: Filesystem
    fs:
      path: /vault/secrets/gcp-creds

Now we can quickly test that everything is working by creating a Bucket resource:

cat <<EOF | kubectl apply --filename=-
apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
metadata:
  name: ps-bucket-${RANDOM}
spec:
  forProvider:
    location: US
EOF

After a little while, you should see the bucket resource ready and synced:

kubectl get bucket
---
NAME                            READY   SYNCED   EXTERNAL-NAME                  AGE
bucket-crossplane-demo-30855    True    True     bucket-crossplane-demo-30855   15m

At this point, we are ready to start working with GCP using Crossplane. I will go through setting up the Terraform provider configs later in the series when it's time to start working with it.

Remember to delete the test bucket resource:

kubectl delete bucket <bucket_name>

The next chapter quickly reviews available configuration options for managed resources.

Crossplane: Streamline your infrastructure provisioning & management

Joonas Venäläinen — Tue, 17 Oct 2023 07:16:15 +0000

Crossplane is an extension to Kubernetes, which transforms your Kubernetes cluster into a universal control plane. It allows you to manage anything that has an API available with the help of provider packages. It's also fully extendable, so you can build your own providers to support your APIs. Everybody likes 🍕 so here is a post providers-101-ordering-pizza-with-kubernetes-and-crossplane which goes through on a high level how to build a provider that is capable of ordering pizza for you inside Kubernetes cluster.

Crossplane is often used to interact with cloud providers like Azure, AWS, and GCP. By using the Crossplane, you can bring your infrastructure management to Kubernetes. Another significant benefit of Crossplane is that it acts as a Kubernetes Controller, constantly monitoring the state of external resources. So, if someone would modify/delete the resources outside of Kubernetes, Crossplane would reconcile the resources.

By connecting Kubernetes and the cloud provider APIs with Kubernetes Custom Resource Definitions (CRDs), Crossplane can be used to enable a Kubernetes-native approach to managing cloud resources. We can for example define a Database custom resource which will then provision a database in the cloud provider that we have hooked Crossplane into. This makes it easier for developers to start consuming infrastructure resources by hiding the complexity behind Crossplane compositions. The Ops team can create and maintain these compositions and the XRDs which define how the resource will look like for the consumers. Without going into too much detail, we could have a custom resource Database that the developers can consume.

apiVersion: storage.polarsquad.com/v1alpha1
kind: Database
metadata:
  name: ps-demo-db-mysql
spec:
  parameters:
    name: ps-demo-db-mysql
    size: "small"

Here, we define the size as small. The Ops team could set small, medium, and large options for the size. Then, in compositions, patch these values to specific instance types depending on the cloud platform. This also enhances the experience for the developers as they don't have to know the specific instance types.

As the resources are native Kubernetes manifests, you can bundle them with the other manifests you use to deploy the application to the cluster. When the resources are created, Crossplane will create the connection secrets to the application namespace for pods to consume.

Overall, the Crossplane allows you to form your own Cloud Platform inside Kubernetes. With the help of compositions, you can build a self-service platform where developers can easily create resources on-demand when they need them without having to jump through hoops to get additional backing services.

Throughout the series, I will be using the terms XRD,XR, and XRS. Here is a quick overview of what they stand for.

XRD - Composite Resource Definition
XR - Composite Resource
XRC - Claim

Prerequisites:

Access to Google Cloud and gcloud cli
Kubernetes cluster eg. minikube
Helm
Access to Aiven

Aiven offers a free tier so you can create an account and use it to get through the tutorial.

Install Crossplane

helm repo add crossplane-stable https://charts.crossplane.io/stable

helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane

At this point, the Crossplane is ready, and the next step is to install the needed Providers that give the Crossplane capabilities to provision managed resources to external systems.

In this series, we will provision resources to Google Cloud using the Google Cloud providers and, later in the series, leverage Terraform provider to manage resources that don't have a Crossplane native provider available yet.

Links to each part of this series:

Crossplane: How do providers work

Prometheus Observability Platform: Grafana

Aleksi Waldén — Thu, 14 Sep 2023 10:28:01 +0000

Grafana is the industry standard open-source product for visualising metrics stored in a TSDB format, or a variety of other data sources. With Grafana, we can create dashboards, queries, and alerts from the data that we have. With all our metrics in long-term storage, we can use a single data source to access all the metrics from all our infrastructure that uses the metrics platform. This enables easily creating dashboards that aggregate data from multiple different Kubernetes clusters, and enable drilling down to a single resource easily.

Demo

Next, we will set up a Grafana instance into our minikube and use Promxy as the default data source. This example assumes that you have completed the following steps, as the components from those are needed:

Prerequisites:

base64

First we start with adding the Grafana Helm chart repository, and installing its contents into the Grafana namespace:

helm repo add grafana https://grafana.github.io/helm-charts

Next, we define Promxy as the data source. In the Helm values file, we need the following block to do this:

datasources.yaml:
  apiVersion: 1
  datasources:
  - name: Promxy
    type: prometheus
    url: "http://promxy.promxy.svc.cluster.local:8082"
    isDefault: true

We are using the svc.cluster.local address for the Promxy service, because all our services are inside the cluster.

I have converted the above into json so that it can be passed to Helm:

helm install grafana grafana/grafana --create-namespace --namespace grafana --set-json 'datasources={"datasources.yaml":{"apiVersion":1,"datasources":[{"name":"Promxy","type":"prometheus","url":"http://promxy.promxy.svc.cluster.local:8082","isDefault":true}]}}'

Next we need to get the admin password for the admin user:

kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Now we can port-forward the Grafana service:

kubectl port-forward -n grafana services/grafana 9090:80

Navigate to http://localhost:9090 to access the web UI, and log in with the username admin, and the password acquired in the previous step. From here you can verify that Promxy is set up and acting as the default data source by navigating to Administration -> Data sources -> Promxy, and clicking the Test button at the bottom of the page.

Assuming the test was successful, we can then navigate to the “Explore” item in the menu

and check that we have metrics available in the “metrics explorer” section. Alternatively, we can use the following query to check that metrics are available:

sum(kube_pod_container_status_restarts_total) by (namespace, container)

N.B. you might have to change the time range for the query to get results:

We have now achieved setting up Grafana as the metrics visualisation tool for our metrics platform. This enables us to create dashboards and Grafana alerts for metrics from all sources sending metrics to our long-term storage cluster (or clusters if we have multiple regions) that are queried using Promxy.

Prometheus Observability Platform: Application metrics

Aleksi Waldén — Thu, 14 Sep 2023 10:27:31 +0000

When creating our own applications, we need to use a metrics library to generate the metrics and then inside our application functions increment said metrics. With Go for example, we can use the Prometheus library. The metrics will then be exposed to the /metrics endpoint. If our application is inside a Kubernetes cluster with a prometheus-operator, we can use a ServiceMonitor to scrape its metrics. If we don’t have such a possibility, we can instead set up an application to send metrics straight to our long-term storage solution. For VictoriaMetrics, we can use the github.com/VictoriaMetrics/metrics library to send the metrics to VictoriaMetrics. Remember to add authentication logic into the section pushing the metrics to the long-term storage, if necessary.

Demo

This example assumes that you have completed the following steps as the components from those are needed:

Prometheus Observability Platform: Prometheus

Let's set up a hello-world Golang application in our cluster, and use ServiceMonitor to send its metrics to Prometheus.

First, we need to update our kube-prometheus-stack Helm deployment to pick up ServiceMonitor resources with a certain label attached. We need to pass the following value to our Helm chart:

prometheus:
  prometheusSpec:
    serviceMonitorSelector:
      matchExpressions:
      - key: app
        operator: Exists

I have converted that into json so that it can be passed to Helm:

helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace prometheus --reuse-values --set-json 'prometheus.prometheusSpec.serviceMonitorSelector={"matchExpressions":[{"key":"app","operator":"Exists"}]}'

This will update our kube-prometheus-stack to pick up ServiceMonitor resources from any namespace, as long as they have an app label attached.

Next, we are going to create a namespace for our hello-world application which is a simple Golang application exposing metrics via the Prometheus module. We will borrow this already-made application, which has logic defined to increment a metric called hello_processed_total each time the page is loaded.

To create a namespace and a pod, we use the following commands:

kubectl create namespace hello-world

kubectl run hello-world --namespace=hello-world --image='okteto/hello-world:golang-metrics' --labels app=hello-world

Next, we need to create a service for the new pod:

cat <<'EOF' | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
  labels:
    app: hello-world
  name: hello-world
  namespace: hello-world
spec:
  ports:
  - name: http
    port: 8080
  selector:
    app: hello-world
  type: ClusterIP
EOF

Now we can test that our application is working by port-forwarding it. We can also check what the hello_processed_total metric looks like:

kubectl port-forward -n hello-world services/hello-world 9090:8080

Now navigate to http://localhost:9090 and http://localhost:9090/metrics. You should see a metric called hello_processed_total with a number attached. Each reload of the page will increment this number.

Next, we need to set up a ServiceMonitor to send these metrics to Prometheus:

cat <<'EOF' | kubectl create -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hello-world
  namespace: hello-world
  labels:
    app: hello-world
spec:
  selector:
    matchLabels:
      app: hello-world
  endpoints:
    - port: http
EOF

This ServiceMonitor will target services matching the label selector (app=hello-world) and will scrape the port called “http”.

Now, if we port-forward our Prometheus service, we should see a new service in the service discovery section:

kubectl port-forward -n prometheus services/kube-prometheus-stack-prometheus 9090:9090

Navigate to http://localhost:9090/service-discovery and you should see that there is a new service discovered with the name serviceMonitor/hello-world/hello-world/0 and it should show 1/1 active targets.

We can now query the hello_processed_total metric:

We have now achieved sending metrics from our custom app running in its own namespace into Prometheus.

Next part: Prometheus Observability Platform: Grafana

Prometheus Observability Platform: Handling multiple regions

Aleksi Waldén — Thu, 14 Sep 2023 10:27:13 +0000

When we have multiple regions such as the EU and US, we need to have a long-term storage solution running in both of those. If we want to combine the resources into a single query we need to use a query layer that can query both endpoints. One such component we can use is Promxy.

Promxy uses the same PromQL syntax as Prometheus and we can define server groups with multiple endpoints. In our case, we would define our EU and US long-term storage endpoints under one server group. We can then use the single Promxy endpoint to query both the EU and the US.

Demo

This example assumes that you have completed the following steps, as the components from those are needed:

We can use the Helm chart offered in the Promxy repository to deploy a proxy to our Kubernetes cluster.

First we clone the repository, because the Helm chart is not published to a public registry:

git clone https://github.com/jacksontj/promxy.git

Then we navigate to the folder containing the Helm chart:

cd promxy/deploy/k8s/helm-charts/promxy/

We set up Promxy with the following server_groups:

server_groups:
  - static_configs:
    - targets:
      - vmcluster-victoria-metrics-cluster-vmselect.victoriametrics.svc.cluster.local:8481
      labels:
        region: eu
    scheme: http
    path_prefix: /select/0/prometheus

I have converted this to json so we can pass it to the helm chart:

{"server_groups":[{"static_configs":[{"targets":["vmcluster-victoria-metrics-cluster-vmselect.victoriametrics.svc.cluster.local:8481"],"labels":{"region":"eu"}}],"scheme":"http","path_prefix":"/select/0/prometheus"}]}

To install Promxy from the local Helm chart we use the following:

helm install promxy . --create-namespace --namespace promxy --set 'image.tag=latest' --set-json 'config.promxy={"server_groups":[{"static_configs":[{"targets":["vmcluster-victoria-metrics-cluster-vmselect.victoriametrics.svc.cluster.local:8481"],"labels":{"region":"eu"}}],"scheme":"http","path_prefix":"/select/0/prometheus"}]}'

We can now port-forward the Promxy service and access the web UI from http://localhost:9090:

kubectl port-forward -n promxy services/promxy 9090:8082

Here we can perform the same query for the kube_pod_container_status_restarts_total to verify that Promxy is able to reach the VictoriaMetrics data:

If we have more regions than just the US, we can add them under the server_groups and query multiple VictoriaMetrics instances from a single Promxy source.

We have now achieved setting up Promxy with server_groups to use for querying VictoriaMetrics instances.

Next part: Prometheus Observability Platform: Application metrics

Prometheus Observability Platform: Alert routing

Aleksi Waldén — Thu, 14 Sep 2023 10:26:37 +0000

Alertmanager is a component usually bundled with Prometheus to handle routing the alerts to receivers such as Slack, e-mail, and PagerDuty. It uses a routing tree to send alerts to one or multiple receivers.

Routes define which receivers each alert should be sent to. You can define rules for the routes. The rules are evaluated from top to bottom, and alerts are sent to matching receivers. Usually, the match block is used to match the label name and value for a certain receiver. Notification integrations are configured for each receiver. There are multiple different options available, such as email_configs, slack_configs, and webhook_configs.

Alertmanager has a web UI that can be used to view current alerts and silence them if needed.

With a platform setup, we usually don’t want to use multiple Alertmanagers, so we disable the provisioning of additional alertmanagers for Prometheus deployments that include them automatically. Instead, we use one centralised Alertmanager inside, for example, a Kubernetes cluster which is aimed at monitoring platform usage.

Demo

This example assumes that you have completed the following steps, as the components from those are needed:

Prerequisites:

Amtool (https://github.com/prometheus/alertmanager#install-1)

Now that we have an alert defined and deployed to vmalert we can add Alertmanager to our platform. Because we are creating this with a platform aspect in mind, we will install Alertmanager as a separate resource, and not as a part of the kube-platform-stack. We will use a tool called amtool which is bundled with alertmanager to run unit tests on our alert rules

We can install the Alertmanager with the following Helm chart:

helm install alertmanager prometheus-community/alertmanager --create-namespace --namespace alertmanager

We can now port-forward the alertmanager service to access the alertmanager web UI from http://localhost:9090:

kubectl port-forward -n alertmanager services/alertmanager 9090:9093

To trigger a test alert, we can use the following command from another terminal tab while keeping the port-forwarding on:

curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"TestAlert"}}]' localhost:9090/api/v1/alerts

We can now use amtool to list the currently firing alerts:

amtool alert query --alertmanager.url=http://localhost:9090
---
Alertname   Starts At                Summary  State   
TestAlert   2023-07-07 07:23:55 UTC           active

Let's add a test receiver and routing for it. Below is an example of the configuration we want to pass to Alertmanager in Helm values format.

config:
  receivers:
    - name: default-receiver
    - name: test-team-receiver

  route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: 'test-team-receiver'
        matchers:
        - team="test-team"

I have converted the above into a json one-liner so we can pass it into Helm without having to create an intermediate file.

helm upgrade alertmanager prometheus-community/alertmanager --namespace alertmanager --set-json 'config.receivers=[{"name":"default-receiver"},{"name":"test-team-receiver"}]' --set-json 'config.route={"receiver":"default-receiver","group_wait":"30s","group_interval":"5m","repeat_interval":"4h","routes":[{"receiver":"test-team-receiver","matchers":["team=\"test-team\""]}]}'

We can now use amtool to test that an alert that has the label team=test-team gets routed to the test-team-receiver:

amtool config routes test --alertmanager.url=http://localhost:9090 team=test-team
---
test-team-receiver

amtool config routes test --alertmanager.url=http://localhost:9090 team=test     
---
default-receiver

We have now set up an Alertmanager which can route alerts depending on team label value.

Next, we need to update vmalert to route alerts into the Alertmanager using the cluster local address of the alertmanager service:

helm upgrade vmalert vm/victoria-metrics-alert --namespace victoriametrics --reuse-values --set server.notifier.alertmanager.url="http://alertmanager.alertmanager.svc.cluster.local:9093"

Now we can run a pod that will be crashing to increment the kube_pod_container_status_restarts_total metric by creating a pod that has a typo in the sleep command:

kubectl run crashpod --image busybox:latest --command -- slep 1d

Next we port-forward the alertmanager service. We should see an alert in there when we navigate to http://localhost:9090:

kubectl port-forward -n alertmanager services/alertmanager 9090:9093

We have now achieved setting up Alertmanager as our tool for routing alerts from the vmalert component.

Next part: Prometheus Observability Platform: Handling multiple regions