Atul Vishwakarma

Posted on Jun 29

TerraGuard AI: Adding an AI Brain to Terraform Drift Detection — Design Decisions, Trade-offs, and Lessons Learned

#ai #aws #infrastructure #devops

Repo: https://github.com/vatul16/TerraGuard-AI — full source, Terraform infra, drift detection engine, and CI/CD workflows.

When I finished TerraTier, the 3-tier AWS architecture project I wrote about last time, I wanted to build something that pushed in a different direction. TerraTier was about the infrastructure layer — networking, security boundaries, secrets management. This project is about the operational layer: what happens after the infrastructure is running, when someone inevitably makes a change outside of Terraform.

The result is TerraGuard AI — an event-driven drift detection system that catches manual AWS console changes, classifies their risk using an LLM, and routes context-rich alerts intelligently instead of just firing a generic notification every time a description field changes. A Node.js API on ECS Fargate, a PostgreSQL database on RDS, and an ALB are the infrastructure being monitored. GitHub Actions, Python, and Groq's API for Llama 3.3 are how the detection and analysis work.

This article is about the decisions behind why it works the way it does, and what I'd change.

The problem with standard drift detection

The textbook implementation of Terraform drift detection is: run terraform plan -detailed-exitcode on a cron job, check the exit code, send an alert if it's non-zero. This works, and it takes maybe 30 lines of shell script to implement. The problem isn't the detection — it's what happens to the humans on the other end of the alert.

In any environment where more than one person touches the AWS console, drift is nearly constant. Someone tags a security group directly in the console, or updates a description, or changes a retention period on a log group that wasn't in Terraform yet. Each of these fires the exact same alert as "port 22 opened to 0.0.0.0/0 on a production security group." After the third or fourth time the on-call engineer wakes up to an alert and finds a tag change, they start doing what humans do with noisy signals: they ignore them.

Alert fatigue is a well-documented problem in SRE literature, but most drift detection tooling solves it by giving you more configuration options — exception lists, ignore patterns, threshold tuning. That's fighting symptoms. The actual cause is that the alerting system has no concept of severity. It knows something changed; it doesn't know what that means.

The insight that shaped this project is that detection and classification are two separate problems, and they need different tools. Terraform is genuinely excellent at detection — it has the full resource schema, it talks directly to the AWS APIs, and its plan output is deterministic and reliable. It has no idea what a change means from a security or business perspective. That's a reasoning problem, and it's exactly what language models are good at.

So TerraGuard AI uses Terraform to detect drift, and an LLM to classify it. The two tools do the parts they're each actually suited for.

Decision 1: AI for classification, not detection

This distinction is worth dwelling on, because it's easy to imagine AI being used to find drift — some vector-similarity approach comparing current state to desired state, or something. That would be using AI to badly replicate something Terraform already does perfectly.

The LLM layer in this project only ever sees the parsed output of terraform plan. Drift is already confirmed by the time the AI gets involved. What the AI does is answer a different set of questions: Is this change a security risk or a configuration nuance? Should it be reverted or adopted into the Terraform codebase? What is the exact command someone should run to remediate it, right now, without having to look anything up?

The prompt asks Groq's Llama 3.3 to act as a Senior Cloud Security Engineer and return a structured JSON response with a risk level (CRITICAL / HIGH / MEDIUM / LOW / INFO), a score out of 10, an impact summary, a recommended action (REVERT / ADOPT / INVESTIGATE / MONITOR), and a concrete remediation command. The temperature is set to 0.1 — not 0, because zero-temperature responses can be oddly mechanical, but close enough to zero that the analysis is consistent and factual rather than creative.

{
  "risk_level": "CRITICAL",
  "risk_score": 9,
  "category": "Security",
  "summary": "Security group ingress rule added exposing SSH port 22 to all internet traffic.",
  "impact": "Any host on the internet can attempt SSH brute-force or exploitation against ECS tasks.",
  "action": "REVERT",
  "remediation": "terraform apply -target=aws_security_group.ecs_tasks",
  "reasoning": "Opening port 22 to 0.0.0.0/0 is a critical security misconfiguration..."
}

Based on that response, the routing is simple: CRITICAL or HIGH goes to Slack immediately, plus a GitHub Issue for the audit trail. MEDIUM, LOW, or INFO creates a GitHub Issue that goes into the backlog — no Slack notification, no on-call page.

The practical effect is that the Slack channel only gets messages that actually need a human to look at something now. The rest creates a self-organizing backlog of minor drift that can be handled during normal working hours. That's not a feature — it's the entire point.

Decision 2: Groq over AWS Bedrock

The obvious choice for an AWS-native project would be Amazon Bedrock. The LLM call would live inside the VPC, authentication would use the same IAM role the rest of the infrastructure uses, and there would be no external API dependency.

I chose Groq instead, for two reasons that turned out to matter more than I expected.

The first is speed. Groq's custom LPU hardware runs Llama 3.3 70B at a genuinely unusual pace — we're talking single-digit seconds for a complete risk analysis. Bedrock inference at a comparable model size is slower, and when drift detection runs in a GitHub Actions workflow with a 6-hour cron cadence, latency matters less than it would in a real-time system. But for local development and testing, where you're running the detector manually and waiting for output, the speed difference makes iteration significantly less frustrating.

The second reason is cost. Groq's free tier is 14,400 requests per day, which is more than enough for a drift detection workflow that runs every 6 hours and only calls the API when drift is actually found. Bedrock charges per-token with no free tier. For a personal portfolio project that also needs to pay for ECS, RDS, NAT Gateway, and an ALB, not paying for inference is a meaningful constraint.

The trade-off is the external dependency. The drift detector now requires an outbound HTTPS connection from the GitHub Actions runner to api.groq.com, and if Groq is down or rate-limiting, the analysis step fails. I handled this with a fallback in the Python script: if the Groq call fails or returns unparseable JSON, the detector falls back to a MEDIUM risk assessment with a manual review recommendation rather than crashing entirely. The drift is still logged and acted on; it just lacks the AI analysis.

Decision 3: GitHub Actions over Lambda

The architecturally "purer" approach to scheduled drift detection is a Lambda function triggered by EventBridge on a cron schedule, with an additional EventBridge rule listening to CloudTrail events for real-time detection when a console change happens. That's closer to the enterprise architecture in the project's original design.

I built it on GitHub Actions instead, and I think that was the right call for this stage.

The practical argument: GitHub Actions is free, already configured for the project, and doesn't require provisioning, IAM-scoping, or maintaining a separate compute resource. A Lambda function adds at least three more Terraform resources to manage, an ECR image or deployment package to maintain, and CloudWatch Logs to check when something goes wrong — all of which is real work that doesn't directly demonstrate anything new. The drift detection workflow in GitHub Actions does the same job with less infrastructure surface area and a simpler debugging story (Actions tab in GitHub, logs right there).

The conceptual argument: the drift detector is not a latency-sensitive workload. A 6-hour detection window is fine for a portfolio project, and even for many real production environments, catching a security group change within 6 hours is a meaningful improvement over catching it when someone manually reviews the AWS console. Real-time detection via CloudTrail + EventBridge is in the project roadmap, but it's a layer added on top of a working baseline, not a prerequisite.

Where this decision has a genuine cost: the GitHub Actions runner needs AWS credentials. That means AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as repository secrets — long-lived credentials stored in GitHub, which is a less secure pattern than OIDC role assumption. The Terraform IAM code in the project already scaffolds the OIDC setup (aws_iam_openid_connect_provider, aws_iam_role with a GitHub OIDC trust policy), and migrating to keyless authentication is the highest-priority item on the roadmap. I built with access keys first to get the workflow working without debugging OIDC trust policies at the same time, which is a trade-off I'd make again in the same situation.

Decision 4: Minimal infrastructure being monitored

A reasonable design question for a drift detector is: what infrastructure should it monitor? You could make the case that the more complex the infrastructure, the more impressive the demo — VPCs with 20 resources, multiple environments, Auto Scaling Groups, the works.

I went the opposite direction. The monitored infrastructure is as simple as it can be while still being realistic: one ECS Fargate service, one RDS PostgreSQL database, one ALB. That's a pattern that appears in real production environments constantly — it's the architecture for a backend API behind a load balancer with persistent storage. Nothing exotic.

The reason for keeping it simple is that the interesting thing in this project is the drift detection and classification layer, not the infrastructure. A 40-resource VPC with multiple tiers and Auto Scaling Groups would take most of the explanation budget to describe, and would make the CI/CD and drift detection workflows harder to follow for someone reading the code. By keeping the infrastructure simple and well-understood, someone reading the repository can quickly get past "what does this infrastructure do" and get to "how does the drift detection actually work" — which is the part worth explaining.

Decision 5: Separating infrastructure and application deployment

The most practically important single line in the entire ecs.tf file is this:

lifecycle {
  ignore_changes = [task_definition]
}

Without this, terraform apply and the deploy-app.yml GitHub Actions workflow would fight each other. The deploy workflow updates the ECS task definition to use a new ECR image URI (tagged with the git SHA) every time new code is pushed to app/**. If Terraform's ECS service resource didn't ignore task definition changes, the next terraform apply — triggered by the drift detector or by a manual infra change — would overwrite the CI/CD-deployed image with whatever image URI is in the Terraform variables. Deploying a new version of the app would immediately cause "drift" against the Terraform-managed state.

The ignore_changes lifecycle setting tells Terraform: "I know this attribute might differ from what's in the config, and that's intentional — CI/CD owns it, not you." This is the standard pattern for separating infrastructure management (Terraform's job) from application deployment (CI/CD's job) on ECS, and it's one of those things that sounds obvious in retrospect but takes a debugging session to discover if you haven't seen it before.

The deploy workflow itself follows a pattern worth understanding: it doesn't just push an image and call aws ecs update-service --force-new-deployment. It downloads the current live task definition from AWS, replaces only the container image URI, registers the modified definition as a new revision, and then updates the service to use that revision. The reason for this sequence is that the task definition contains Secrets Manager ARNs, IAM role ARNs, environment variable sets, and health check configuration that are all managed by Terraform. Overwriting the entire task definition from a CI/CD workflow would blow away those fields. Downloading the live definition and patching only the image preserves everything else.

Decision 6: Secrets Manager with ECS secrets references, not environment variables

The Node.js application gets its database connection details through a combination of plain environment variables and a secrets block in the ECS task definition.

Non-sensitive values — DB_HOST, DB_PORT, DB_NAME, DB_USER, NODE_ENV, PORT — are passed as plain environment variables. They're not sensitive: knowing the RDS hostname doesn't help an attacker get into the database.

The password is different. The ECS secrets block references the Secrets Manager secret by ARN, and at container startup, the ECS agent fetches the value, injects it as DB_PASSWORD in the container's environment, and never writes it anywhere that shows up in the task definition JSON, CloudWatch logs, or the AWS console. The only evidence it existed is the ARN reference.

secrets = [
  {
    name      = "DB_PASSWORD"
    valueFrom = aws_secretsmanager_secret.db_password.arn
  }
]

This also required a specific IAM policy attached to the ECS execution role (not the task role — the agent that starts containers, not the container itself):

Action   = ["secretsmanager:GetSecretValue"]
Resource = aws_secretsmanager_secret.db_password.arn

The scoping matters. The policy grants access to exactly one secret ARN, not "Resource": "*". If an attacker found a way to escalate privileges from inside the container to the execution role, they could read one specific secret with a known ARN, not any secret in the account. That's a meaningful reduction in blast radius for a one-line change.

A race condition I hit and what it taught me

When I added the Node.js app port (3000) to the Terraform configuration — replacing the placeholder 80 that was there from the initial setup — running terraform apply produced an error I hadn't seen before:

Error: deleting ELBv2 Target Group: ResourceInUse: Target group is currently in use by a listener or a rule

What happened: changing the port on an ALB target group forces replacement (Terraform creates a new target group, then tries to delete the old one). But the ALB listener still referenced the old target group when Terraform tried to delete it. The listener update and the target group deletion were running in the wrong order — or more precisely, in parallel, and the deletion raced against the listener update and lost.

The fix was simply running terraform apply a second time. The first apply created the new target group and partially updated the listener, but the listener change hadn't completed by the time the old target group deletion ran. The second apply found the listener already pointing to the new target group and successfully deleted the old one.

I mention this because it's the kind of error that's alarming-looking and turns out to be completely benign — and knowing that pattern ("Terraform apply failed with ResourceInUse, run it again") is one of those things that takes a long time to learn from documentation and about five seconds to learn from experience. If you hit it: apply again.

The deeper lesson is about Terraform's dependency graph. Terraform builds an explicit dependency graph from the depends_on and implicit references in your configuration, and parallelizes everything it can within that graph. When you're modifying a resource that has something depending on it and you're replacing the resource rather than updating it in-place, the order of operations for the replacement can be non-obvious. Looking at the terraform plan output more carefully — specifically the -/+ indicators for "destroy and create replacement" — before applying would have flagged this in advance.

What I'd build next

The most significant gap in the current design is real-time detection. Running terraform plan every 6 hours means there's a window of up to 6 hours between a manual console change and the drift detection alert. For most configuration changes that's probably fine. For a security group rule opening port 22 to the internet, it's not.

The architecture for closing that gap exists: an EventBridge rule that listens to CloudTrail events for specific API calls (AuthorizeSecurityGroupIngress, ModifyDBInstance, AttachRolePolicy, and similar high-risk mutations), triggering a Lambda function that runs the plan-and-classify pipeline on demand, within seconds of the change. That's the enterprise-grade pattern that the project's original design was based on, and it's what I'd add in v2.

OIDC authentication for GitHub Actions is the other immediate priority. Storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as repository secrets is functional but carries unnecessary risk — those credentials persist until they're rotated, and rotating them requires remembering they exist. The OIDC setup in iam.tf is already scaffolded; it's a workflow change and a secret deletion away from being the default.

A drift history store is something I've sketched but not built. Right now, every drift detection run is stateless — the result goes to Slack or GitHub Issues, and that's the end of it. If you wanted to answer "how often does our infrastructure drift, in which direction, and is it getting better or worse over time?", there's nowhere to look. A DynamoDB table storing detection results per run, with a simple dashboard, would turn TerraGuard AI from a reactive alerting tool into something with genuine observability over infrastructure compliance trends.

Closing thoughts

The most useful framing I found for this project is that it's not an AI project with some infrastructure bolted on, and it's not an infrastructure project with some Python bolted on. It's an operations automation project that happens to use AI for the part where you'd otherwise need a human to make a judgment call.

That framing matters because it determines what you actually have to understand to build it well. The Groq integration is a dozen lines of Python — the interesting engineering is in understanding when to call it, with what inputs, and what to do with the output, which requires thinking about the infrastructure, the CI/CD pipeline, the IAM model, and the alerting routing as a system. None of those are complicated individually. The interesting work is making them compose.

The full source is at https://github.com/vatul16/terraguard-ai — Terraform, Python detector, GitHub Actions workflows, and the Node.js app. I'm actively looking for Cloud/DevOps Engineer roles as I transition from full-stack development, and I'd genuinely enjoy talking through any part of this with someone who has questions about a specific decision. You can find me on LinkedIn.

Previous project: TerraTier — Production-grade 3-tier AWS architecture with Terraform, Auto Scaling Groups, two ALBs, four subnet tiers, and SSM Session Manager.

DEV Community