Prasad P

Posted on Nov 19

How We Built an AI Terraform Co-Pilot That Actually Works (And Made It Free)

#devops #terraform #ai #infrastructureascode

The Problem We Kept Hitting

Every DevOps engineer has been here: you need to spin up infrastructure, but Terraform syntax is fighting you. You know what you want—"an RDS instance with read replicas in us-east-1"—but translating that to HCL takes 30 minutes of documentation diving.

Existing AI tools? They hallucinate provider versions. They forget required arguments. They generate code that looks right but fails on terraform plan.

We spent 18 months building something better for Realm9, and I want to share the technical approach that made it actually useful.

Why Most AI-to-Terraform Tools Fail

Before diving into our solution, here's why the naive approach doesn't work:

1. Context Window Limitations

Terraform configurations reference modules, variables, and state from across your project. GPT-4 can't see your entire codebase.

2. Version Drift

The AI was trained on Terraform 0.12 syntax but you're running 1.6. Provider APIs change constantly.

3. State Blindness

The AI doesn't know what resources already exist. It'll suggest creating a VPC when you already have three.

4. No Validation Loop

Most tools generate code and hope for the best. No terraform validate, no plan check, no iteration.

Our Architecture: How We Solved It

Here's the technical breakdown of how Realm9's Terraform Co-Pilot actually works:

Layer 1: Project Context Injection

Before any prompt hits the LLM, we build a context package:

├── Current provider versions (from .terraform.lock.hcl)
├── Existing resource inventory (from state)
├── Variable definitions and current values
├── Module interfaces you've defined
└── Your naming conventions (parsed from existing code)

This context gets injected as system prompt, so the AI knows:

You use aws provider 5.31.0, not 4.x
You already have a VPC named main-vpc
Your naming convention is ${project}-${env}-${resource}

Layer 2: Retrieval-Augmented Generation (RAG)

We maintain a vector database of:

Official Terraform provider documentation
AWS/Azure/GCP API specifications
Common patterns and anti-patterns

When you ask "create an S3 bucket with versioning", we retrieve the current S3 resource documentation—not whatever was in GPT's training data 18 months ago.

Layer 3: Validation Loop

Here's where most tools stop. We don't.

User prompt
    ↓
Generate HCL
    ↓
terraform fmt (syntax check)
    ↓
terraform validate (semantic check)
    ↓
If errors → feed errors back to LLM → regenerate
    ↓
terraform plan (dry run)
    ↓
Show plan diff to user

The AI sees its own mistakes and fixes them. Usually takes 1-2 iterations to get valid code.

Layer 4: BYOK (Bring Your Own Key)

We don't lock you into our API costs. On the free tier, you plug in your own OpenAI/Anthropic/Azure OpenAI key. You control:

Which model (GPT-4, Claude, etc.)
Rate limits
Cost

Paid tiers include API credits so you don't have to manage keys.

Real Example: What This Looks Like

User input:

Create an RDS PostgreSQL instance for production with:
- Multi-AZ deployment
- 100GB storage with autoscaling
- Private subnet only
- 7-day backup retention

What the AI generates (after validation loop):

resource "aws_db_instance" "production_postgres" {
  identifier     = "${var.project}-${var.environment}-postgres"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.large"

  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"

  multi_az               = true
  db_subnet_group_name   = aws_db_subnet_group.private.name
  vpc_security_group_ids = [aws_security_group.rds.id]

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  skip_final_snapshot = false
  final_snapshot_identifier = "${var.project}-${var.environment}-postgres-final"

  tags = local.common_tags
}

Notice it:

Used your existing naming convention (var.project, var.environment)
Referenced your existing subnet group and security group
Picked appropriate engine version for current provider
Added sensible defaults you didn't specify (maintenance window, final snapshot)

Why We Made the AI Free

The free tier includes:

5 users
10 environments
1 Terraform project with 3 workspaces
Full AI co-pilot with BYOK

Why give away the AI? Because:

AI is table stakes now - Charging for basic AI features feels wrong in 2025
BYOK means no margin anyway - You're paying OpenAI directly
The value is the complete platform - AI alone isn't useful; AI integrated with full Terraform lifecycle management is

Our paid tiers ($9.2k-$48k/year) are for teams that need more capacity, enterprise security (SSO/SAML), and included API credits.

Beyond AI: Complete Terraform Lifecycle Management

The AI co-pilot is just one part. Realm9 provides end-to-end Terraform lifecycle management:

Projects & Workspaces

Organize infrastructure into projects with multiple workspaces (dev, staging, prod)
GitOps integration with GitHub/GitLab for version control
Automatic plan/apply workflows with approval gates

Enterprise-Grade Security

End-to-end encryption for all credentials and secrets
Cloud provider credentials stored with AES-256 encryption
No plaintext secrets ever touch disk

Compliance & Audit Trail

SOC 2 Type II compliant controls
ISO 27001 security framework
Complete audit logging of every action
Who ran what, when, and what changed
Exportable audit reports for compliance reviews

State Management

Secure remote state storage
State locking to prevent conflicts
State versioning and rollback capabilities
Drift detection between state and actual infrastructure

This isn't just an AI wrapper—it's a complete Terraform platform that happens to have AI built in.

The Bigger Picture: Environment Management

The AI co-pilot is part of Realm9, a platform that also handles:

Environment booking - No more spreadsheets or Slack wars over who's using staging
Built-in observability - Logs/metrics/traces at 1/10th the cost of Datadog
Drift detection - Know when infrastructure doesn't match code

We built it because we were spending $150k+/year on Plutora + Terraform Cloud + Datadog, and they didn't even talk to each other.

Try It Yourself

Option 1: Self-host free tier

Installation guide - Deploy on your Kubernetes cluster in 30 minutes
Bring your own LLM API key
Full AI co-pilot included

Option 2: Evaluate enterprise features

14-day evaluation - Test Terraform automation, SSO/SAML, advanced AI
No credit card required

Option 3: Explore the code

GitHub: realm9-platform - Star the repos to follow development

What's Next

We're working on:

Multi-cloud support - Same AI, different providers (Azure, GCP)
Cost estimation - "This change will add ~$45/month"
Policy as Code - AI suggests compliant configurations

Follow our GitHub or check realm9.app for updates.

Questions? Drop them in the comments. I'll answer everything about the architecture, AI approach, or why we made certain decisions.

Top comments (1)

aarthi kapoor • Nov 19

Really well-written and transparent, thank you! This is the clearest explanation of a Terraform + AI tool I’ve ever read. The validation loop + context injection combo is genius.