Jesus Oviedo Riquelme

Posted on Oct 17

LLPY-11: Terraform - Infraestructura como Código

#spanish #devops #googlecloud #terraform

🎯 El Desafío de Gestionar Infraestructura Cloud

Imagina que necesitas desplegar tu sistema RAG en GCP:

✅ VM para Qdrant (Compute Engine)
✅ API en Cloud Run (FastAPI)
✅ Batch Job (Cloud Run Job para procesamiento)
✅ Storage (Google Cloud Storage)
✅ Secrets (Secret Manager para .env y JWT keys)

El problema: ¿Cómo creas, actualizas y gestionas toda esta infraestructura de forma reproducible, versionada y colaborativa?

Opciones para Gestionar Infraestructura

Método	Pros	Contras	Reproducibilidad
Console UI (manual)	Fácil, visual	Propenso a errores, no versionado	❌ Ninguna
gcloud CLI scripts	Automatizado	Scripts frágiles, difícil rollback	⚠️ Limitada
Cloud Formation	IaC nativo AWS	Solo AWS	✅ Alta (AWS only)
Pulumi	Multiple lenguajes	Requiere runtime	✅ Alta
Terraform	Declarativo, multi-cloud	Learning curve	✅✅ Muy Alta

Nuestra elección: Terraform

📊 La Magnitud del Problema

Requisitos de Infraestructura

Para el proyecto Lus Laboris necesitamos:

🗄️ Google Cloud Storage
- Bucket para datos procesados
- Bucket para Terraform state (remoto)
🖥️ Compute Engine VM
- VM para Qdrant (vector database)
- Ubuntu 22.04 LTS
- Firewall rules (ports 6333, 6334, 22)
- SPOT instance (cost optimization)
🚀 Cloud Run Service
- FastAPI container
- Secrets montados de Secret Manager
- Auto-scaling (0-10 instances)
- 2 CPU, 2Gi RAM
⏰ Cloud Run Job
- Batch processing (scheduled)
- Cron schedule
- Logs y notificaciones
🔐 Secret Manager
- .env file (all API config)
- JWT public key
- IAM permissions

Desafíos Sin IaC

Sin Terraform:

Developer 1 crea VM via Console → Settings no documentados
Developer 2 necesita replicar → "¿Qué settings usaste?"
Developer 1: "Hmm... creo que era e2-medium con 20GB..."
Developer 2: "¿Y el firewall?"
Developer 1: "Olvidé... tendrás que experimentar"

Disaster Recovery: "La VM se borró, ¿cómo la recreo?"
Team: 🤷 "No hay backup de la configuración"

Con Terraform:

git clone repo
terraform apply
✅ Toda la infraestructura recreada en 5 minutos
✅ Configuración exacta versionada en Git
✅ Documentación viva en código
✅ Rollback a cualquier versión anterior

💡 La Solución: Terraform Infrastructure as Code

¿Qué es Terraform?

Terraform (por HashiCorp) es una herramienta de Infrastructure as Code (IaC) que permite:

📝 Definir infraestructura en archivos declarativos (HCL)
📊 Planear cambios antes de aplicarlos
🚀 Aplicar cambios de forma idempotente
🗑️ Destruir infraestructura limpiamente
🔄 Versionado en Git como código

Principios de Terraform

Declarativo: Describes el estado deseado, no los pasos
Idempotente: Ejecutar N veces = mismo resultado
Plan-Apply: Preview de cambios antes de ejecutar
State Management: Trackea estado actual vs deseado
Modular: Componentes reutilizables

🏗️ Arquitectura de Terraform en el Proyecto

Estructura de Carpetas

terraform/
├── main.tf                    # Orquestación de módulos
├── variables.tf               # Definición de variables
├── terraform.tfvars           # Valores de variables
├── providers.tf               # Configuración de GCP provider
├── tf_menu.sh                 # Script interactivo para ops comunes
├── README.md                  # Documentación
│
└── modules/                   # Módulos reutilizables
    ├── gcs/                   # Google Cloud Storage
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    │
    ├── compute_engine/        # VM para Qdrant
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    │
    ├── cloud_run_service/     # FastAPI en Cloud Run
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    │
    ├── cloud_run_job/         # Batch processing
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    │
    └── secret_manager/        # Secrets management
        ├── main.tf
        ├── variables.tf
        └── outputs.tf

Principio de diseño: Un módulo = un recurso lógico reutilizable

🚀 Implementación Paso a Paso

1. Configuración de Providers

Archivo terraform/providers.tf:

# Terraform Configuration
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 7.1.1"
    }
  }

  backend "gcs" {
    bucket = "py-labor-law-rag-terraform-state"
    prefix = "terraform/state"
  }
}

# Provider de Google Cloud Platform
provider "google" {
  project = var.project_id
  region  = var.region
}

Características:

✅ Remote state: State almacenado en GCS (no local)
✅ Versión fijada: ~> 7.1.1 = 7.1.x (semver)
✅ State locking: GCS provee locking automático
✅ Team collaboration: Múltiples devs comparten state

2. Definición de Variables

Archivo terraform/variables.tf:

# Basic GCP Configuration
variable "project_id" {
  description = "GCP Project ID"
  type        = string
}

variable "region" {
  description = "GCP Region"
  type        = string
  default     = "us-central1"
}

variable "project_number" {
  description = "GCP Project Number (for service accounts)"
  type        = string
}

# Google Cloud Storage
variable "bucket_name" {
  description = "Name for the GCS bucket"
  type        = string
}

# Compute Engine (Qdrant VM)
variable "qdrant_vm_name" {
  description = "Name for the Qdrant VM"
  type        = string
  default     = "qdrant-vm"
}

variable "qdrant_vm_machine_type" {
  description = "Machine type for Qdrant VM"
  type        = string
  default     = "e2-medium"
}

variable "qdrant_vm_zone" {
  description = "Zone for Qdrant VM"
  type        = string
  default     = "us-central1-a"
}

variable "qdrant_vm_disk_size" {
  description = "Boot disk size in GB"
  type        = number
  default     = 20
}

# Cloud Run Service (API)
variable "api_service_name" {
  description = "Name of the Cloud Run service"
  type        = string
  default     = "lus-laboris-api"
}

variable "api_image" {
  description = "Docker image for the API"
  type        = string
}

variable "api_container_port" {
  description = "Container port for the API"
  type        = number
  default     = 8000
}

variable "api_cpu" {
  description = "Number of CPUs for the API"
  type        = string
  default     = "2"
}

variable "api_memory" {
  description = "Memory for the API (e.g., 2Gi)"
  type        = string
  default     = "2Gi"
}

variable "api_min_instance_count" {
  description = "Minimum number of instances"
  type        = number
  default     = 0  # Scale to zero for cost savings
}

variable "api_max_instance_count" {
  description = "Maximum number of instances"
  type        = number
  default     = 10
}

variable "api_timeout" {
  description = "Request timeout"
  type        = string
  default     = "300s"
}

# Secret Manager
variable "api_env_secret_id" {
  description = "Secret ID for .env file"
  type        = string
}

variable "jwt_public_key_secret_id" {
  description = "Secret ID for JWT public key"
  type        = string
}

# Cloud Run Job (Batch)
variable "job_name" {
  description = "Name of the Cloud Run Job"
  type        = string
}

variable "image" {
  description = "Docker image for the job"
  type        = string
}

variable "args" {
  description = "Arguments for the job"
  type        = list(string)
  default     = []
}

variable "schedule" {
  description = "Cron schedule for the job"
  type        = string
  default     = "0 2 * * *"  # 2 AM daily
}

variable "notify_email" {
  description = "Email for job notifications"
  type        = string
}

3. Módulo: Compute Engine (VM para Qdrant)

El módulo compute_engine crea la VM para Qdrant con configuración optimizada:

¿Qué crea?

VM Instance (google_compute_instance.qdrant_vm)
- OS: Ubuntu 22.04 LTS
- Machine type: e2-medium (2 vCPU, 4GB RAM)
- Disco: 20GB (configurable)
- Provisioning: SPOT (preemptible) ← 80% más barato
- Network: Default con IP pública efímera
- Tags: qdrant-server, http-server, https-server
Firewall Rule (google_compute_firewall.qdrant_firewall)
- Permite TCP en puertos: 6333 (HTTP), 6334 (gRPC), 22 (SSH)
- Source: 0.0.0.0/0 (⚠️ en producción, restringir)
- Target: VMs con tag qdrant-server
Outputs del módulo
- vm_external_ip: Para conectar desde API
- vm_internal_ip: Para comunicación interna en GCP
- vm_name y vm_zone: Para referencia

Ventajas de SPOT instances:

💰 80% más barato que VM regular
🔄 Preemptible: Puede interrumpirse (OK para Qdrant con persistencia)
✅ Ideal para workloads stateful con almacenamiento persistente

4. Módulo: Cloud Run Service (FastAPI)

El módulo cloud_run_service despliega la API en Cloud Run:

¿Qué crea?

Cloud Run Service (google_cloud_run_v2_service.api_service)
- Container: Docker image desde Docker Hub o GCR
- Puerto: 8000 (FastAPI)
- CPU: 2 cores
- Memoria: 2Gi
- Scaling: 0-10 instances (scale-to-zero habilitado)
- Timeout: 300s (5 minutos)
- Deletion protection: false (permite destroy fácil)
IAM Policy (acceso público)
- Role: roles/run.invoker
- Members: allUsers (cualquiera puede llamar la API)
Traffic Routing
- 100% del tráfico a latest revision
- Deployment strategy: Rolling update

Características clave:

✅ Scale to zero: Min instances = 0 → ahorro de costos cuando no hay tráfico
✅ Auto-scaling: Escala automáticamente según requests
✅ Serverless: No gestión de servers
✅ Pay-per-use: Solo pagas cuando hay requests

Nota sobre Secrets:
Los secrets (.env y JWT key) se montan vía GitHub Actions workflow, no directamente en Terraform. El módulo básico crea el servicio, y el workflow update-api-secrets-deploy.yml actualiza el servicio con los secrets.

5. Módulo: Secret Manager

El módulo secret_manager crea secrets en GCP Secret Manager:

¿Qué crea?

Secret para .env (google_secret_manager_secret.api_env_file)
- ID: Configurable (típicamente api-env-file)
- Replication: Automática (multi-region)
- Contenido: Se agrega después vía GitHub Actions
Secret para JWT public key (google_secret_manager_secret.jwt_public_key)
- ID: Configurable (típicamente jwt-public-key)
- Replication: Automática
- Contenido: Public key agregada vía GitHub Actions
IAM Permissions (para Cloud Run)
- Role: roles/secretmanager.secretAccessor
- Member: Service Account de Cloud Run
- Permite a Cloud Run leer los secrets

Flujo de trabajo:

Terraform crea los secret placeholders (vacíos)
GitHub Actions agrega el contenido real de los secrets
Cloud Run monta los secrets como archivos

Ventajas:

✅ Separación: Terraform gestiona estructura, GitHub Actions gestiona contenido sensible
✅ Seguridad: Secrets no versionados en código
✅ Rotación: Fácil actualizar secrets sin cambiar infraestructura

6. Main File (Orquestación)

Archivo terraform/main.tf:

# Google Cloud Storage
module "gcs" {
  source      = "./modules/gcs"
  project_id  = var.project_id
  region      = var.region
  bucket_name = var.bucket_name
}

# Compute Engine VM (Qdrant)
module "compute_engine" {
  source = "./modules/compute_engine"

  vm_name      = var.qdrant_vm_name
  machine_type = var.qdrant_vm_machine_type
  zone         = var.qdrant_vm_zone
  disk_size    = var.qdrant_vm_disk_size
}

# Secret Manager (MUST be created BEFORE Cloud Run)
module "secret_manager" {
  source = "./modules/secret_manager"

  project_id                = var.project_id
  cloud_run_service_account = var.project_number
  api_env_secret_id         = var.api_env_secret_id
  jwt_public_key_secret_id  = var.jwt_public_key_secret_id
}

# Cloud Run Service (API)
module "cloud_run_service" {
  source = "./modules/cloud_run_service"

  project_id         = var.project_id
  region             = var.region
  service_name       = var.api_service_name
  image              = var.api_image
  container_port     = var.api_container_port
  cpu                = var.api_cpu
  memory             = var.api_memory
  min_instance_count = var.api_min_instance_count
  max_instance_count = var.api_max_instance_count
  timeout            = var.api_timeout
  env_secret_name    = module.secret_manager.env_secret_name
  jwt_secret_name    = module.secret_manager.jwt_secret_name

  depends_on = [module.secret_manager]
}

# Cloud Run Job (Batch Processing)
module "cloud_run_job" {
  source       = "./modules/cloud_run_job"
  project_id   = var.project_id
  region       = var.region
  job_name     = var.job_name
  image        = var.image
  args         = var.args
  schedule     = var.schedule
  notify_email = var.notify_email
}

# Outputs
output "qdrant_vm_external_ip" {
  description = "External IP for Qdrant VM"
  value       = module.compute_engine.vm_external_ip
}

output "api_service_url" {
  description = "URL of the deployed API"
  value       = module.cloud_run_service.service_url
}

output "bucket_name" {
  description = "Name of the GCS bucket"
  value       = module.gcs.bucket_name
}

🔄 Workflow de Uso

1. Inicialización

# Navegar a directorio terraform
cd terraform

# Configurar credenciales
export GOOGLE_APPLICATION_CREDENTIALS="../.gcpcredentials/service-account.json"

# Inicializar Terraform (descargar providers, configurar backend)
terraform init

Output:

Initializing the backend...
Initializing modules...
Initializing provider plugins...
- Finding hashicorp/google versions matching "~> 6.0"...
- Installing hashicorp/google v6.8.0...

Terraform has been successfully initialized!

2. Plan (Preview de Cambios)

# Ver qué cambios se aplicarán
terraform plan

Output:

Terraform will perform the following actions:

  # module.compute_engine.google_compute_firewall.qdrant_firewall will be created
  + resource "google_compute_firewall" "qdrant_firewall" {
      + name    = "qdrant-vm-firewall"
      + network = "default"
      + ports   = ["6333", "6334", "22"]
    }

  # module.compute_engine.google_compute_instance.qdrant_vm will be created
  + resource "google_compute_instance" "qdrant_vm" {
      + name         = "qdrant-vm"
      + machine_type = "e2-medium"
      + zone         = "us-central1-a"
    }

  # module.cloud_run_service.google_cloud_run_v2_service.api_service will be created
  + resource "google_cloud_run_v2_service" "api_service" {
      + name     = "lus-laboris-api"
      + location = "us-central1"
    }

Plan: 15 to add, 0 to change, 0 to destroy.

3. Apply (Ejecutar Cambios)

# Aplicar cambios
terraform apply

# O sin confirmación (CI/CD):
terraform apply -auto-approve

Output:

module.secret_manager.google_secret_manager_secret.api_env_file: Creating...
module.secret_manager.google_secret_manager_secret.jwt_public_key: Creating...
module.compute_engine.google_compute_firewall.qdrant_firewall: Creating...
module.gcs.google_storage_bucket.bucket: Creating...
module.compute_engine.google_compute_instance.qdrant_vm: Creating...

...

Apply complete! Resources: 15 added, 0 changed, 0 destroyed.

Outputs:

api_service_url = "https://lus-laboris-api-abc123-uc.a.run.app"
bucket_name = "my-data-bucket"
qdrant_vm_external_ip = "34.123.45.67"

4. Verificación

# Ver outputs
terraform output

# Output específico
terraform output api_service_url

# Health check de la API
curl $(terraform output -raw api_service_url)/api/health

5. Destroy (Eliminar Infraestructura)

# Eliminar TODA la infraestructura
terraform destroy

# Con auto-approve (cuidado!)
terraform destroy -auto-approve

🎯 Script Interactivo: tf_menu.sh

El proyecto incluye un script bash sofisticado que automatiza operaciones Terraform:

¿Qué hace tf_menu.sh?

Setea credenciales GCP automáticamente
- Busca archivo JSON en .gcpcredentials/
- Exporta GOOGLE_APPLICATION_CREDENTIALS
Genera terraform.tfvars desde .env
- Lee variables de .env del proyecto
- Crea terraform.tfvars automáticamente
- Valida formato de variables (VM name, zone, etc.)
Valida ambiente antes de ejecutar
- Verifica que credenciales estén seteadas
- Verifica que terraform.tfvars exista
Ofrece menú interactivo

   ========= Terraform Menu =========
   1) Setear GOOGLE_APPLICATION_CREDENTIALS
   2) Crear archivo terraform.tfvars
   3) terraform init
   4) terraform plan
   5) terraform apply
   6) terraform destroy
   7) Salir

Uso del script:

cd terraform
chmod +x tf_menu.sh

# Modo interactivo
./tf_menu.sh

# Modo no-interactivo (para scripting)
./tf_menu.sh 1   # Setear credenciales
./tf_menu.sh 2   # Crear tfvars
./tf_menu.sh 3   # Init
./tf_menu.sh 4   # Plan
./tf_menu.sh 5   # Apply

Ventajas del script:

✅ Automatiza setup: Crea tfvars desde .env
✅ Validaciones: Previene errores comunes
✅ DRY: No duplicar variables entre .env y tfvars
✅ Developer-friendly: Menú en español, mensajes claros

🔄 Integración con CI/CD (GitHub Actions)

Archivo .github/workflows/terraform-apply-on-tf-change.yml:

name: Terraform Apply on TF Change

on:
  push:
    paths:
      - 'terraform/**/*.tf'
      - 'terraform/**/*.tfvars'

jobs:
  terraform:
    runs-on: ubuntu-latest
    env:
      GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GSA_KEY }}

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: 1.9.0

    - name: Terraform Init
      working-directory: ./terraform
      run: terraform init

    - name: Terraform Validate
      working-directory: ./terraform
      run: terraform validate

    - name: Terraform Plan
      working-directory: ./terraform
      run: terraform plan -out=tfplan

    - name: Terraform Apply
      working-directory: ./terraform
      run: terraform apply -auto-approve tfplan

    - name: Show Outputs
      working-directory: ./terraform
      run: terraform output

Ventajas:

✅ Auto-apply: Cambios en .tf files → auto-deploy
✅ Validation: Validate antes de apply
✅ Plan saved: Plan guardado para apply exacto
✅ Outputs visible: Ver resultados en logs

🎯 Casos de Uso Reales

Para Replicar Ambiente:

"Necesito crear un ambiente de staging idéntico a producción"

Solución:

# Crear workspace para staging
terraform workspace new staging

# Aplicar con variables de staging
terraform apply -var-file="staging.tfvars"

# Resultado: Infraestructura idéntica en minutos

Para Disaster Recovery:

"La VM de Qdrant se eliminó accidentalmente"

Solución:

# Ver estado actual
terraform state list

# Recrear solo la VM
terraform apply -target=module.compute_engine

# Resultado: VM recreada con configuración exacta

Para Actualizar Recursos:

"Necesito aumentar RAM de la API de 2Gi a 4Gi"

Solución:

// terraform/terraform.tfvars
api_memory = "4Gi"  // Era "2Gi"

terraform apply

# Resultado: Cloud Run actualizado con zero downtime

Para Múltiples Ambientes:

"Quiero dev, staging, y prod con diferentes specs"

Solución:

# Crear workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod

# Usar variables por workspace
terraform workspace select dev
terraform apply -var="api_cpu=1" -var="api_memory=1Gi"

terraform workspace select prod
terraform apply -var="api_cpu=4" -var="api_memory=8Gi"

🚀 El Impacto Transformador

Antes de Terraform:

🖱️ Manual deployment: Click en console por 30-60 minutos
📝 Documentación: Notas dispersas, desactualizadas
🐛 Errores: "Olvidé abrir el puerto 6334"
🔄 Replicación: Imposible crear ambiente idéntico
💥 Disaster recovery: Días para recrear infraestructura

Después de Terraform:

⚡ Auto deployment: terraform apply = 5-10 minutos
📝 Documentación viva: El código ES la documentación
✅ Sin errores: Configuración probada y versionada
🔄 Replicación perfecta: terraform apply en cualquier ambiente
💥 Disaster recovery: Minutos para recrear todo

Métricas de Mejora:

Aspecto	Sin IaC	Con Terraform	Mejora
Tiempo de setup	30-60 min	5-10 min	-80%
Errores de config	Frecuentes	Raros	-90%
Reproducibilidad	Imposible	Perfecta	+100%
Versionado	No	Sí (Git)	N/A
Rollback	Manual	`git revert` + apply	N/A
Team collaboration	Caos	Sincronizado	N/A

💡 Lecciones Aprendidas

1. Remote State es Obligatorio

Nunca uses state local para proyectos reales. GCS backend permite colaboración y previene conflictos.

2. Módulos = Reutilización

Un módulo bien diseñado se puede usar en múltiples proyectos con mínimos ajustes.

3. Variables con Defaults Sensatos

Defaults buenos = menos config required = menos errores.

4. Outputs son Documentación

Outputs muestran información crítica (IPs, URLs) que necesitas después del deploy.

5. depends_on para Orden

Algunos recursos deben crearse en orden (Secret Manager → Cloud Run). Usa depends_on explícitamente.

6. SPOT Instances = 80% Ahorro

Para workloads tolerantes a interrupciones (como Qdrant con persistencia), SPOT instances son oro.

🎯 El Propósito Más Grande

Terraform no es solo "automatización" - es infraestructura como producto. Al tratar infraestructura como código:

📝 Versionado: Cada cambio en Git con diff, blame, history
👥 Colaboración: Pull requests para cambios de infraestructura
🔍 Review: Code review de infraestructura antes de aplicar
🔄 Rollback: git revert + terraform apply = rollback instant
📚 Documentación: El código no miente, siempre está actualizado
🧪 Testing: Ambientes efímeros para testing
🚀 Velocity: De idea a producción en minutos, no horas

Estamos construyendo infraestructura con la misma calidad, rigor y velocidad que el código de aplicación.

🔗 Recursos y Enlaces

Próximo Post: LLPY-12 - Docker y Containerización

En el siguiente post exploraremos cómo containerizar la aplicación con Docker, multi-stage builds, optimización de imágenes, Docker Compose para desarrollo local, y publicación en Docker Hub.

🎯 El Desafío de Gestionar Infraestructura Cloud

Opciones para Gestionar Infraestructura

📊 La Magnitud del Problema

Requisitos de Infraestructura

Desafíos Sin IaC

💡 La Solución: Terraform Infrastructure as Code

¿Qué es Terraform?

Principios de Terraform

🏗️ Arquitectura de Terraform en el Proyecto

Estructura de Carpetas

🚀 Implementación Paso a Paso

1. Configuración de Providers

2. Definición de Variables

3. Módulo: Compute Engine (VM para Qdrant)

4. Módulo: Cloud Run Service (FastAPI)

5. Módulo: Secret Manager

6. Main File (Orquestación)

🔄 Workflow de Uso

1. Inicialización

2. Plan (Preview de Cambios)

3. Apply (Ejecutar Cambios)

4. Verificación

5. Destroy (Eliminar Infraestructura)

🎯 Script Interactivo: tf_menu.sh

🔄 Integración con CI/CD (GitHub Actions)

🎯 Casos de Uso Reales

Para Replicar Ambiente:

Para Disaster Recovery:

Para Actualizar Recursos:

Para Múltiples Ambientes:

🚀 El Impacto Transformador

Antes de Terraform:

Después de Terraform:

Métricas de Mejora:

💡 Lecciones Aprendidas

1. Remote State es Obligatorio

2. Módulos = Reutilización

3. Variables con Defaults Sensatos

4. Outputs son Documentación

5. depends_on para Orden

6. SPOT Instances = 80% Ahorro

🎯 El Propósito Más Grande

🔗 Recursos y Enlaces

Repositorio del Proyecto

Documentación Técnica

Recursos Externos