DEV Community

Cover image for How We Built an AI Terraform Co-Pilot That Actually Works (And Made It Free)
Prasad P
Prasad P

Posted on

How We Built an AI Terraform Co-Pilot That Actually Works (And Made It Free)

The Problem We Kept Hitting

Every DevOps engineer has been here: you need to spin up infrastructure, but Terraform syntax is fighting you. You know what you want—"an RDS instance with read replicas in us-east-1"—but translating that to HCL takes 30 minutes of documentation diving.

Existing AI tools? They hallucinate provider versions. They forget required arguments. They generate code that looks right but fails on terraform plan.

We spent 18 months building something better for Realm9, and I want to share the technical approach that made it actually useful.


Why Most AI-to-Terraform Tools Fail

Before diving into our solution, here's why the naive approach doesn't work:

1. Context Window Limitations

Terraform configurations reference modules, variables, and state from across your project. GPT-4 can't see your entire codebase.

2. Version Drift

The AI was trained on Terraform 0.12 syntax but you're running 1.6. Provider APIs change constantly.

3. State Blindness

The AI doesn't know what resources already exist. It'll suggest creating a VPC when you already have three.

4. No Validation Loop

Most tools generate code and hope for the best. No terraform validate, no plan check, no iteration.


Our Architecture: How We Solved It

Here's the technical breakdown of how Realm9's Terraform Co-Pilot actually works:

Layer 1: Project Context Injection

Before any prompt hits the LLM, we build a context package:

├── Current provider versions (from .terraform.lock.hcl)
├── Existing resource inventory (from state)
├── Variable definitions and current values
├── Module interfaces you've defined
└── Your naming conventions (parsed from existing code)
Enter fullscreen mode Exit fullscreen mode

This context gets injected as system prompt, so the AI knows:

  • You use aws provider 5.31.0, not 4.x
  • You already have a VPC named main-vpc
  • Your naming convention is ${project}-${env}-${resource}

Layer 2: Retrieval-Augmented Generation (RAG)

We maintain a vector database of:

  • Official Terraform provider documentation
  • AWS/Azure/GCP API specifications
  • Common patterns and anti-patterns

When you ask "create an S3 bucket with versioning", we retrieve the current S3 resource documentation—not whatever was in GPT's training data 18 months ago.

Layer 3: Validation Loop

Here's where most tools stop. We don't.

User prompt
    ↓
Generate HCL
    ↓
terraform fmt (syntax check)
    ↓
terraform validate (semantic check)
    ↓
If errors → feed errors back to LLM → regenerate
    ↓
terraform plan (dry run)
    ↓
Show plan diff to user
Enter fullscreen mode Exit fullscreen mode

The AI sees its own mistakes and fixes them. Usually takes 1-2 iterations to get valid code.

Layer 4: BYOK (Bring Your Own Key)

We don't lock you into our API costs. On the free tier, you plug in your own OpenAI/Anthropic/Azure OpenAI key. You control:

  • Which model (GPT-4, Claude, etc.)
  • Rate limits
  • Cost

Paid tiers include API credits so you don't have to manage keys.


Real Example: What This Looks Like

User input:

Create an RDS PostgreSQL instance for production with:
- Multi-AZ deployment
- 100GB storage with autoscaling
- Private subnet only
- 7-day backup retention
Enter fullscreen mode Exit fullscreen mode

What the AI generates (after validation loop):

resource "aws_db_instance" "production_postgres" {
  identifier     = "${var.project}-${var.environment}-postgres"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.large"

  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"

  multi_az               = true
  db_subnet_group_name   = aws_db_subnet_group.private.name
  vpc_security_group_ids = [aws_security_group.rds.id]

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  skip_final_snapshot = false
  final_snapshot_identifier = "${var.project}-${var.environment}-postgres-final"

  tags = local.common_tags
}
Enter fullscreen mode Exit fullscreen mode

Notice it:

  • Used your existing naming convention (var.project, var.environment)
  • Referenced your existing subnet group and security group
  • Picked appropriate engine version for current provider
  • Added sensible defaults you didn't specify (maintenance window, final snapshot)

Why We Made the AI Free

The free tier includes:

  • 5 users
  • 10 environments
  • 1 Terraform project with 3 workspaces
  • Full AI co-pilot with BYOK

Why give away the AI? Because:

  1. AI is table stakes now - Charging for basic AI features feels wrong in 2025
  2. BYOK means no margin anyway - You're paying OpenAI directly
  3. The value is the complete platform - AI alone isn't useful; AI integrated with full Terraform lifecycle management is

Our paid tiers ($9.2k-$48k/year) are for teams that need more capacity, enterprise security (SSO/SAML), and included API credits.


Beyond AI: Complete Terraform Lifecycle Management

The AI co-pilot is just one part. Realm9 provides end-to-end Terraform lifecycle management:

Projects & Workspaces

  • Organize infrastructure into projects with multiple workspaces (dev, staging, prod)
  • GitOps integration with GitHub/GitLab for version control
  • Automatic plan/apply workflows with approval gates

Enterprise-Grade Security

  • End-to-end encryption for all credentials and secrets
  • Cloud provider credentials stored with AES-256 encryption
  • No plaintext secrets ever touch disk

Compliance & Audit Trail

  • SOC 2 Type II compliant controls
  • ISO 27001 security framework
  • Complete audit logging of every action
  • Who ran what, when, and what changed
  • Exportable audit reports for compliance reviews

State Management

  • Secure remote state storage
  • State locking to prevent conflicts
  • State versioning and rollback capabilities
  • Drift detection between state and actual infrastructure

This isn't just an AI wrapper—it's a complete Terraform platform that happens to have AI built in.


The Bigger Picture: Environment Management

The AI co-pilot is part of Realm9, a platform that also handles:

  • Environment booking - No more spreadsheets or Slack wars over who's using staging
  • Built-in observability - Logs/metrics/traces at 1/10th the cost of Datadog
  • Drift detection - Know when infrastructure doesn't match code

We built it because we were spending $150k+/year on Plutora + Terraform Cloud + Datadog, and they didn't even talk to each other.


Try It Yourself

Option 1: Self-host free tier

  • Installation guide - Deploy on your Kubernetes cluster in 30 minutes
  • Bring your own LLM API key
  • Full AI co-pilot included

Option 2: Evaluate enterprise features

  • 14-day evaluation - Test Terraform automation, SSO/SAML, advanced AI
  • No credit card required

Option 3: Explore the code


What's Next

We're working on:

  • Multi-cloud support - Same AI, different providers (Azure, GCP)
  • Cost estimation - "This change will add ~$45/month"
  • Policy as Code - AI suggests compliant configurations

Follow our GitHub or check realm9.app for updates.


Questions? Drop them in the comments. I'll answer everything about the architecture, AI approach, or why we made certain decisions.

Top comments (1)

Collapse
 
aarthi_kapoor_5d37748b4be profile image
aarthi kapoor

Really well-written and transparent, thank you! This is the clearest explanation of a Terraform + AI tool I’ve ever read. The validation loop + context injection combo is genius.