George Lukas

Posted on Mar 6

Chapter 4: GitOps with Terraform + ArgoCD — Self-Hosting LLMs as a Platform Product

#ai #devops #sre #tutorial

Each chapter in this series has been a deliberate step toward removing friction from LLM infrastructure management. Chapter 2 introduced Infrastructure as Code by mapping Kubernetes resources directly to Terraform — functional, but painfully verbose. Chapter 3 resolved the abstraction problem by bringing in the Helm provider, collapsing 500 lines of HCL into 50 and allowing Terraform to reason about applications rather than individual resources. Both approaches, however, shared the same fundamental constraint: every change still required a human to run terraform apply. The cluster had no awareness of Git, drift went undetected until someone noticed, and scaling that model to larger teams or more frequent deploys would inevitably make manual execution a bottleneck. Chapter 4 closes that loop by introducing GitOps with ArgoCD — making the cluster itself responsible for continuously reconciling its state against Git, without anyone needing to trigger a command.

The Four Principles of GitOps

GitOps is an operational paradigm built on four interconnected principles:

1. Declarative
Everything the cluster needs to run is expressed as YAML files. There is no procedural "run this command" — only a description of the desired end state.

2. Versioned
All state lives in Git. Every change has an author, a timestamp, and a diff. Rollback becomes as simple as reverting a commit.

3. Pull-based (key differentiator)

Cap 3: Developer → terraform apply → push → Cluster
Cap 4: Developer → git push → Git ← poll ← ArgoCD (no cluster)

4. Continuous reconciliation
ArgoCD runs an infinite loop:

while true:
  git_state = fetch_from_git()
  cluster_state = fetch_from_kubernetes()
  if git_state != cluster_state:
    apply_changes()
  sleep(180)  # 3 minutes

Context and Rationale

With the mechanics of GitOps established, it is worth stepping back to understand why this shift matters beyond a single team running git push. Removing terraform apply from the workflow is only the surface benefit — the deeper payoff is that GitOps creates the operational foundation for treating infrastructure as a product, with real ownership, roadmaps, and SLAs.

Growing Cloud Complexity

Cloud infrastructure promised simplicity. In 2010, AWS offered ~20 intuitive services, provisioned with clicks in the UI. The value proposition was clear: simplify infrastructure.

Today, reality is different:

AWS (2010):
├── ~20 services
├── Provisioning via UI
└── Simplicity as core value

AWS (2025):
├── 175+ services
├── Provisioning via IaC
└── Complexity as new reality

Organizational impact:

According to Max Griffiths (Thoughtworks):

"Rising cloud complexity is putting many organizations and their infrastructure teams right back to where they were 15 years ago — struggling to keep up with demand for new services and instances, and stay on top of an increasingly unmanageable infrastructure footprint."

Consequences:

Time to market increased (instead of decreasing)
Required skills grew exponentially
Self-service became impractical
Full-stack developer scope expanded to the point of becoming a disadvantage

Infrastructure as Product

Infrastructure as Product treats infrastructure not as a centralized service provider, but as a portfolio of internal products that enable product teams to deliver value quickly.

Paradigm shift:

Traditional (project-based):
Developer → Ticket → Ops Team → Wait → Provision → Deploy
Timeline: Days/weeks
Bottleneck: Ops team on critical path

Infrastructure as Product:
Developer → Self-service Platform → Provision → Deploy
Timeline: Minutes/hours
Enabler: Platform removes friction

Three Core Principles

1. Developer as Customer

Infrastructure should be designed around developer experience first. If the platform is hard to use, it won't be adopted — regardless of how technically sound it is.

Success metrics:

Developer satisfaction score
Time to provision (target: <30 min)
Platform adoption rate
Support tickets reduction

Anti-pattern:

# Requires learning a new language just to provision
developer:
  must_learn: [HCL, Pulumi, CloudFormation]
  to_do: "Same work as before"
  result: "Poor developer experience"

Correct pattern:

# Familiar interface, correct abstraction
developer:
  uses: [YAML, familiar APIs]
  self_service: true
  result: "High adoption, low friction"

2. Platform Teams With a Product Mindset

According to Sebastian Straube (Accenture), infrastructure teams shouldn't operate as a shared service that reacts to tickets — they should be restructured into dedicated platform product teams, each owning a slice of the internal platform with a roadmap, SLAs, and real accountability to their users.

Organization example:

Platform Team: Compute & Container
├── Product: Kubernetes-as-a-Service
├── Customers: All development teams
├── Roadmap: 
│   ├── Q1: Auto-scaling GPU nodes
│   ├── Q2: Service mesh integration
│   └── Q3: Cost optimization
├── SLA: 99.9% uptime, <5min provision
└── Metrics: Adoption rate, developer satisfaction

Platform Team: Observability
├── Product: Unified logs/metrics/traces
├── Customers: All teams (dev + ops)
├── Roadmap:
│   ├── Q1: 30-day retention
│   ├── Q2: AIOps integration
│   └── Q3: Cost attribution
├── SLA: <5s query response
└── Metrics: Query volume, alert accuracy

Characteristics:

Long-lived teams (not project-based)
Own product roadmap and backlog
Accountable for adoption and satisfaction
Measure success by customer outcomes

3. Self-Service as a Core Value

The platform must enable true self-service — meaning the platform team is never on the critical path of a deployment.

Anti-pattern:

Developer → Submit PR → Platform review → Approve → Deploy
Problem: Platform team on critical path (bottleneck)

Correct pattern:

Developer → Use platform API → Automated validation → Deploy
Platform: Monitor outcomes

Real-world example:

# Developer workflow
kubectl apply -f app.yaml

# Platform validates automatically:
# ✓ Security policies (OPA)
# ✓ Resource quotas
# ✓ Network policies
# ✓ Deploy without manual approval

GitOps as the Technical Foundation

GitOps is not just "automation" — it is the enabling layer that makes Infrastructure as Product operationally viable. Each of its four properties maps directly to a platform capability:

1. Declarative = Product Catalog

The Git repository becomes a catalog of available products that any team can consume, customize, and deploy independently.

k8s-apps/
├── apps/ollama/          # Product: LLM Inference
├── apps/librechat/       # Product: Chat Interface
└── apps/postgres/        # Product: Database

Developer "consumes" a product via:

git clone k8s-apps
cp -r apps/ollama apps/my-llm
vim apps/my-llm/values.yaml  # Customize
git commit && git push
# ArgoCD auto-deploy

2. Self-Service = Developer Autonomy

ArgoCD removes the platform team from the critical path:

Without ArgoCD:
Developer → Code → Request deploy → Platform team → Manual deploy

With ArgoCD:
Developer → Code → Git push → ArgoCD auto-sync → Deployed

3. API-Driven = Programmatic Access

ArgoCD Application CRDs are the deployment API:

# Developer creates workload via Kubernetes API
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-service
spec:
  source:
    repoURL: https://github.com/company/k8s-apps
    path: apps/my-service
  destination:
    server: https://kubernetes.default.svc
    namespace: my-namespace
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

4. Standardization With Flexibility

Helm charts serve as reusable product templates — the platform defines the structure, teams supply the configuration.

Platform provide templates:
├── web-app/          # Template for web apps
├── ml-service/       # Template for ML workloads
└── data-pipeline/    # Template for ETL

Developer customize:
cp -r templates/web-app apps/my-app
vim apps/my-app/values.yaml  # Configuration only
git push  # Automatic deployment

Architectural Decision

App of Apps with Independent Wrappers

Rather than managing all services as a single unit, this architecture treats each service as an independent package within the App of Apps pattern. GitOps manages deployments, versions, and updates at the individual service level — not the stack level.

Each service (Ollama, LibreChat) has its own Helm chart wrapper that:

References the upstream chart via dependencies in Chart.yaml
Customizes configuration via a local values.yaml
Maintains independent versioning
Allows isolated evolution

Wrapper Structure

apps/
├── librechat/
│   ├── Chart.yaml      # Wrapper chart with dependency
│   └── values.yaml     # Specific configuration
└── ollama/
    ├── Chart.yaml      # Wrapper chart with dependency
    └── values.yaml     # Specific configuration

Chart.yaml defines the dependency:

dependencies:
  - name: ollama
    version: "1.42.0"                              # Fixed version
    repository: "https://otwld.github.io/ollama-helm/"

values.yaml customizes the upstream chart:

ollama:        # Wrapper namespace
  ollama:      # Chart namespace (double hierarchy)
    gpu:
      enabled: true
    models:
      pull:
        - llama3.2:3b

Benefits of This Approach

1. Service Independence

Each app evolves at its own pace
Deploying Ollama does not affect LibreChat
Reduces the risk of cross-service regressions

2. Granular Versioning

apps/ollama/Chart.yaml:      version: "1.42.0"
apps/librechat/Chart.yaml:   version: "1.9.7"

Upstream chart versions pinned individually
Upgrades tested and applied per service
Independent rollback per application

3. Isolated Customization

Values specific per service
No configuration conflicts
Individual testability

4. Per-Service Observability

ArgoCD Application per service
Isolated logs and events
Specific health checking

5. Automated Deployment

Git commit → ArgoCD detects → Helm processes wrapper → Deploy to cluster

ArgoCD manages each wrapper as an Application
Automatic sync per service
Independent self-healing

6. Complete Tracking

git log apps/ollama/values.yaml
# Complete change history for Ollama

git log apps/librechat/values.yaml
# Complete change history for LibreChat

Audit trail per service
Specific PRs per change
Granular rollback via Git

Types of Wrappers

Wrapper with Public Chart (HTTP):

# apps/ollama/Chart.yaml
dependencies:
  - name: ollama
    version: "1.42.0"
    repository: "https://otwld.github.io/ollama-helm/"

Wrapper with OCI Chart:

# apps/librechat/Chart.yaml
dependencies:
  - name: librechat
    version: "1.9.7"
    repository: "oci://ghcr.io/danny-avila/librechat-chart"

Wrapper with Sub-charts:

# apps/custom-app/Chart.yaml
dependencies:
  - name: app
    version: "1.0.0"
    repository: "https://..."
  - name: postgres
    version: "12.0.0"
    repository: "https://charts.bitnami.com/bitnami"
  - name: redis
    version: "17.0.0"
    repository: "https://charts.bitnami.com/bitnami"

Governance and Standardization

No tight coupling:

Each service maintains flexibility
Standards enforced via linting (CI/CD)
Common templates available, not mandatory

Example of suggested standard:

# All services follow label convention
metadata:
  labels:
    app.kubernetes.io/name: {{ .Chart.Name }}
    app.kubernetes.io/instance: {{ .Release.Name }}
    app.kubernetes.io/version: {{ .Chart.AppVersion }}
    app.kubernetes.io/managed-by: argocd

But each service can add specific labels without breaking others.

Alternatives Considered and Rejected

❌ Single monorepo without wrappers:

# values.yaml (monolítico)
ollama:
  gpu: ...
librechat:
  config: ...
postgres:
  ...

Problems:

A change in Ollama affects LibreChat's versioning
Atomic all-or-nothing deployment
Rollback impacts all services

❌ Single custom chart:

# custom-chart/
templates/
  ollama.yaml
  librechat.yaml
  postgres.yaml

Problems:

Reinventing the wheel (official charts already exist)
Heavy maintenance burden
Complicates upstream upgrades

✅ Independent wrappers (chosen):

Reuses upstream charts
Independence between services
Easy maintenance
Flexible governance

Trade-offs

Advantages:

Full independence between services
Granular versioning
Individual rollback
Isolated observability
Scalability (adding an app = new directory)

Disadvantages:

Duplication of common configurations (e.g., ingress annotations)
Requires linting to enforce standards
More files to manage

Considerations

Use when:

Medium/large team (5+ people)
Multiple independent services
Frequent deploys per service
Need for granular rollback
Distributed governance (teams own services)

Do not use when:

Monolithic application (single service)
Very small team (1–2 people)
All services always deployed together
Preference for extreme simplicity

Implementing Terraform

This section implements main.tf. It does everything Terraform is responsible for in this architecture — which is intentionally narrow. Rather than managing applications directly, main.tf bootstraps the platform once and then hands control to ArgoCD. Concretely, it does four things in sequence: creates the three namespaces, provisions the credentials secret that LibreChat needs at runtime, installs ArgoCD via Helm, and registers the two ArgoCD Application CRDs that point at the Git repository. After that first terraform apply, Terraform is largely out of the picture — all subsequent application changes flow through Git.

The file is organised into four logical blocks, each separated by a comment header:

main.tf
├── Providers          — Kubernetes + Helm provider config
├── Infrastructure     — Namespaces + credentials secret
├── ArgoCD             — Helm installation + values reference
└── ArgoCD Applications — CRDs that register Ollama and LibreChat

The subsections below walk through each block in order.

Providers

(main.tf lines 1–27 — terraform {} block + provider declarations)

terraform {
  required_version = ">= 1.0"

  required_providers {
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.23"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.11"
    }
  }
}

provider "kubernetes" {
  config_path    = "~/.kube/config"
  config_context = "minikube"
}

provider "helm" {
  kubernetes {
    config_path    = "~/.kube/config"
    config_context = "minikube"
  }
}

The provider configuration is identical to Chapter 3: both the kubernetes and helm providers read from ~/.kube/config and target the minikube context. No changes needed here.

Namespaces

(main.tf — INFRASTRUCTURE block)

resource "kubernetes_namespace" "argocd" {
  metadata {
    name = "argocd"
    labels = {
      managed-by = "terraform"
      purpose    = "gitops"
    }
  }
}

resource "kubernetes_namespace" "ollama" {
  metadata {
    name = "ollama"
    labels = {
      managed-by = "terraform"
    }
  }
}

resource "kubernetes_namespace" "librechat" {
  metadata {
    name = "librechat"
    labels = {
      managed-by = "terraform"
    }
  }
}

Design decision: Namespaces are managed by Terraform rather than ArgoCD because they are prerequisites for everything else — ArgoCD itself cannot deploy into a namespace that does not yet exist. They also change rarely enough that the overhead of GitOps reconciliation is not justified. The purpose = "gitops" label distinguishes tooling namespaces from application namespaces at a glance.

Infrastructure Secrets

(main.tf — still inside the INFRASTRUCTURE block, immediately after the namespaces)

resource "kubernetes_secret" "librechat_credentials" {
  metadata {
    name      = "librechat-credentials-env"
    namespace = kubernetes_namespace.librechat.metadata[0].name
  }

  data = {
    JWT_SECRET         = var.jwt_secret
    JWT_REFRESH_SECRET = var.jwt_refresh_secret
    CREDS_KEY          = var.creds_key
    CREDS_IV           = var.creds_iv
    MONGO_URI          = "mongodb://librechat-mongodb:27017/LibreChat"
    MEILI_HOST         = "http://librechat-meilisearch:7700"
    OLLAMA_BASE_URL    = "http://ollama.ollama.svc.cluster.local:11434"
  }

  type = "Opaque"
}

Design decision: Credentials are managed by Terraform precisely because they must not appear in Git. Committing plaintext secrets to a repository — even a private one — violates the principle of least exposure. For production environments, the right path is to replace this with Sealed Secrets, External Secrets Operator, or HashiCorp Vault, all of which allow secrets to live in Git in an encrypted or referenced form. Chapter 5 covers this.

ArgoCD Installation

(main.tf — ARGOCD block)

resource "helm_release" "argocd" {
  name       = "argocd"
  repository = "https://argoproj.github.io/argo-helm"
  chart      = "argo-cd"
  namespace  = kubernetes_namespace.argocd.metadata[0].name
  version    = "5.51.6"

  values = [
    file("${path.module}/values/argocd-values.yaml")
  ]

  timeout       = 600
  wait          = true
  wait_for_jobs = true

  depends_on = [
    kubernetes_namespace.argocd
  ]
}

ArgoCD Applications (CRDs)

(main.tf — ARGOCD APPLICATIONS block — the last block in the file)

With ArgoCD installed, the final step is registering the applications it should manage. Each kubernetes_manifest block below creates an ArgoCD Application CRD — a declarative instruction telling ArgoCD which Git path to watch, which cluster namespace to deploy into, and how to handle sync and drift.

Ollama:

resource "kubernetes_manifest" "argocd_app_ollama" {
  manifest = {
    apiVersion = "argoproj.io/v1alpha1"
    kind       = "Application"
    metadata = {
      name      = "ollama"
      namespace = kubernetes_namespace.argocd.metadata[0].name
      labels = {
        managed-by = "terraform"
      }
    }
    spec = {
      project = "default"

      source = {
        repoURL        = var.git_repo_url
        targetRevision = var.git_branch
        path           = "apps/ollama"
        helm = {
          valueFiles = ["values.yaml"]
        }
      }

      destination = {
        server    = "https://kubernetes.default.svc"
        namespace = kubernetes_namespace.ollama.metadata[0].name
      }

      syncPolicy = {
        automated = {
          prune    = true
          selfHeal = true
        }
        syncOptions = [
          "CreateNamespace=false"
        ]
      }
    }
  }

  depends_on = [
    helm_release.argocd,
    kubernetes_namespace.ollama
  ]
}

Breaking Down:

spec.source:

source = {
  repoURL        = "https://github.com/<user>/k8s-apps.git"
  targetRevision = "main"
  path           = "apps/ollama"
  helm = {
    valueFiles = ["values.yaml"]
  }
}

repoURL: Git repository to watch

targetRevision: Branch, tag, or specific SHA

path: Directory within the repo

helm.valueFiles: Array of values files (merged in order)

spec.destination:

destination = {
  server    = "https://kubernetes.default.svc"
  namespace = "ollama"
}

server: API server of the target cluster

kubernetes.default.svc: Local cluster (where ArgoCD is)
External URL: Deploy to a remote cluster

namespace: Destination namespace in the cluster

spec.syncPolicy:

syncPolicy = {
  automated = {
    prune    = true
    selfHeal = true
  }
  syncOptions = [
    "CreateNamespace=false"
  ]
}

automated:

Present: ArgoCD syncs automatically upon detecting changes
Absent: Manual sync only

prune: true:

Resources deleted from Git are deleted from the cluster
false: Orphaned resources remain in the cluster

selfHeal: true:

Manual changes in the cluster are reverted
ArgoCD forces state = Git
false: Manual changes persist (drift)

syncOptions:

CreateNamespace=false: Do not create namespace (already exists via Terraform)
CreateNamespace=true: Create if it doesn't exist
Validate=false: Skip resource validation
PruneLast=true: Delete orphaned resources last

LibreChat:

resource "kubernetes_manifest" "argocd_app_librechat" {
  manifest = {
    apiVersion = "argoproj.io/v1alpha1"
    kind       = "Application"
    metadata = {
      name      = "librechat"
      namespace = kubernetes_namespace.argocd.metadata[0].name
      labels = {
        managed-by = "terraform"
      }
    }
    spec = {
      project = "default"

      source = {
        repoURL        = var.git_repo_url
        targetRevision = var.git_branch
        path           = "apps/librechat"
        helm = {
          valueFiles = ["values.yaml"]
        }
      }

      destination = {
        server    = "https://kubernetes.default.svc"
        namespace = kubernetes_namespace.librechat.metadata[0].name
      }

      syncPolicy = {
        automated = {
          prune    = true
          selfHeal = true
        }
        syncOptions = [
          "CreateNamespace=false"
        ]
      }
    }
  }

  depends_on = [
    helm_release.argocd,
    kubernetes_namespace.librechat,
    kubernetes_secret.librechat_credentials
  ]
}

The LibreChat Application CRD follows the same structure as Ollama. The only meaningful difference is that it carries an additional depends_on reference to kubernetes_secret.librechat_credentials, ensuring the credentials exist in the cluster before ArgoCD attempts its first sync.

Outputs

(main.tf — OUTPUTS block)

output "argocd_url" {
  value = "http://argocd.glukas.space"
}

output "argocd_admin_password" {
  value     = try(data.kubernetes_secret.argocd_initial_admin.data["password"], "")
  sensitive = true
}

output "argocd_password_command" {
  value = "minikube kubectl -- -n argocd get secret argocd-initial-admin-secret -o jsonpath='{.data.password}' | base64 -d"
}

output "applications_managed" {
  value = {
    ollama = {
      namespace = kubernetes_namespace.ollama.metadata[0].name
      status    = "Managed by ArgoCD"
    }
    librechat = {
      namespace = kubernetes_namespace.librechat.metadata[0].name
      status    = "Managed by ArgoCD"
    }
  }
}

try() function:

try(expression, fallback)

Attempts to execute expression, returns fallback if it fails. Avoids errors when the secret does not yet exist.

With the outputs block in place, main.tf is complete. The full file now captures the entire platform bootstrap in roughly 200 lines: provider config, namespaces, one credentials secret, one Helm release, two ArgoCD Application CRDs, and four outputs. Running terraform apply once produces a live ArgoCD instance watching your Git repository — everything after that point is GitOps.

Implementing Application Charts

The Git repository that ArgoCD watches is separate from the Terraform code. It contains only the Helm wrapper charts — one directory per application — and nothing else. This clean separation means developers working on application configuration never need to touch Terraform, and the platform team can evolve the infrastructure layer independently.

Structure

├── apps
│   ├── librechat
│   │   ├── Chart.yaml
│   │   └── values.yaml
│   └── ollama
│       ├── Chart.yaml
│       └── values.yaml
├── main.tf
├── values
│   └── argocd-values.yaml
└── variables.tf

ArgoCD Configuration (values/argocd-values.yaml)

global:
  domain: argocd.glukas.space

server:
  service:
    type: ClusterIP

  extraArgs:
    - --insecure  # HTTP only — no TLS (development only)

  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - argocd.glukas.space
    paths:
      - /
    pathType: Prefix
    tls: []

  config:
    timeout.reconciliation: 180s  # Polling interval

controller:
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi

repoServer:
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi

applicationSet:
  enabled: true

notifications:
  enabled: false

dex:
  enabled: false

Critical parameters:

timeout.reconciliation: 180s

Git polling frequency
Trade-off: Lower value = faster detection, higher load
Recommendation: 180s (3 min) for most cases

extraArgs: [--insecure]

DEVELOPMENT ONLY
Production must use TLS with valid certificates

resources.limits

Controller: Component responsible for reconciliation
RepoServer: Component that reads Git
Values for ~10 applications; scale as needed

Ollama Configuration

# Chart.yaml
apiVersion: v2
name: ollama
description: Ollama deployment managed by ArgoCD
type: application
version: 1.0.0

dependencies:
  - name: ollama
    version: "1.42.0"
    repository: "https://otwld.github.io/ollama-helm/"

Fields:

apiVersion: v2: Helm 3 API version

name: Wrapper chart name

type: application: Deployable chart (vs library)

version: Wrapper version (local control)

dependencies:

name: ollama: Dependency name

Automatically creates namespace .Values.ollama

version: "1.42.0": Fixed version of the upstream chart

CRITICAL: Always pin the version
Avoid: version: "*" or no version

repository: Chart source

Ollama: values.yaml — Analysing its Double Hierarchy

ollama:  # Layer 1: dependency namespace (created automatically by Helm)

  ollama:  # Layer 2: chart's internal namespace
    gpu:
      enabled: true
      type: nvidia
      number: 1

    models:
      pull:
        - llama3.2:3b
        - deepseek-r1:14b

    ingress:
      enabled: true
      className: nginx
      annotations:
        nginx.ingress.kubernetes.io/proxy-body-size: "0"
        nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
        nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
        nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
      hosts:
        - host: ollama.glukas.space
          paths:
            - path: /
              pathType: Prefix
      tls: []

  service:
    type: ClusterIP
    port: 11434

  resources:
    requests:
      memory: "2Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"

Why Double Hierarchy?

This is one of the most common silent failures when working with Helm wrapper charts. The configuration appears to apply correctly — ArgoCD reports Synced, no errors surface — but the behaviour you expected (GPU enabled, specific models pulled) simply does not materialise. Understanding the mechanism once prevents hours of debugging later.

Helm automatically creates a value namespace when processing a dependency:

dependencies:
  - name: ollama

As a result, .Values.ollama is automatically created.

Upstream chart has an internal namespace:

helm show values https://otwld.github.io/ollama-helm/ollama

Output:

ollama:  # Chart internal namespace
  gpu:
    enabled: false

Consequence:

Wrapper creates:     .Values.ollama
Chart expects:     .Values.ollama.xxx
Result:  .Values.ollama.ollama.xxx

Technical rules:

Every dependency creates a namespace

dependencies:
  - name: X

Helm always creates .Values.X

Some charts have an internal namespace

If the first line is the chart name, there is an internal namespace.

Combination = Duplication

Layer	Origin	Path
1	Wrapper	`.Values.ollama`
2	Chart	`.Values.ollama`
Final	-	`.Values.ollama.ollama`

Validation:

# Verify internal namespace
helm show values <repo>/<chart>

# Render locally
helm template test apps/ollama/

# Search for specific configuration
helm template test apps/ollama/ | grep -A5 "nvidia.com/gpu"

Common mistake:

# ❌ INCORRECT (only one layer)
ollama:
  gpu:
    enabled: true

Result:

Helm looks for: .Values.ollama.ollama.gpu.enabled
Finds: .Values.ollama.gpu.enabled
Uses default: gpu.enabled: false
Symptom: No error, but GPU not enabled

Solution:

# ✅ CORRECT (double layer)
ollama:
  ollama:
    gpu:
      enabled: true

Specific Configurations

GPU:

ollama:
  ollama:
    gpu:
      enabled: true
      type: nvidia      # Alternativa: amd (ROCm)
      number: 1

Generates:

resources:
  limits:
    nvidia.com/gpu: "1"

nodeSelector:
  nvidia.com/gpu: "true"

tolerations:
  - key: nvidia.com/gpu
    operator: Exists

Models:

models:
  pull:
    - llama3.2:3b
    - deepseek-r1:14b

Chart creates an init container:

initContainers:
  - name: pull-models
    command:
      - /bin/sh
      - -c
      - |
        ollama pull llama3.2:3b
        ollama pull deepseek-r1:14b

Ingress annotations:

annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "0"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
  nginx.ingress.kubernetes.io/proxy-send-timeout: "600"

proxy-body-size: "0": No upload limit (models are large)

proxy-read-timeout: "600": 10min timeout (long inference)

proxy-send-timeout: "600": 10min timeout

Service and Resources (root layer):

Note: service and resources sit in the first ollama layer, not the second. The upstream chart does not expose these fields within its internal namespace, so they must be set at the dependency root level — not nested inside the chart's own namespace.

Full structure:

ollama:              # dependency namespace
  ollama:            # chart's internal namespace
    gpu: ...
    models: ...
    ingress: ...

  service: ...       # root level
  resources: ...     # root level

LibreChat Configuration

Chart.yaml

apiVersion: v2
name: librechat
description: LibreChat deployment managed by ArgoCD
type: application
version: 1.0.0

dependencies:
  - name: librechat
    version: "1.9.7"
    repository: "oci://ghcr.io/danny-avila/librechat-chart"

OCI Repository:

repository: "oci://ghcr.io/danny-avila/librechat-chart"

Syntax: oci://<registry>/<owner>/<chart>

Differences vs. HTTP repository:

Uses the same container registry infrastructure
Faster pull
Better integrated versioning

values.yaml

librechat:  # Layer 1: dependency namespace

  librechat:  # Layer 2: chart's internal namespace (double hierarchy)

    configEnv:
      APP_TITLE: "LibreChat + Ollama (via Terraform)"
      HOST: "0.0.0.0"
      PORT: "3080"
      SEARCH: "true"
      MONGO_URI: "mongodb://librechat-mongodb:27017/LibreChat"
      MEILI_HOST: "http://librechat-meilisearch:7700"
      ALLOW_EMAIL_LOGIN: "true"
      ALLOW_REGISTRATION: "true"
      ALLOW_SOCIAL_LOGIN: "false"
      ALLOW_SOCIAL_REGISTRATION: "false"

    configYamlContent: |
      version: 1.1.5
      cache: true
      endpoints:
        custom:
          - name: "Ollama"
            apiKey: "ollama"
            baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"
            models:
              default:
                - "llama2:latest"
              fetch: true
            titleConvo: true
            titleModel: "current_model"
            modelDisplayLabel: "Ollama"

  ingress:
    enabled: true
    className: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "25m"
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
      nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
      nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
    hosts:
      - host: librechat.glukas.space
        paths:
          - path: /
            pathType: Prefix
    tls: []

  resources:
    requests:
      memory: "256Mi"
      cpu: "100m"
    limits:
      memory: "1Gi"
      cpu: "500m"

  persistence:
    enabled: true
    size: 5Gi
    storageClass: "standard"

  replicaCount: 1

  mongodb:
    enabled: true
    image:
      registry: docker.io
      repository: bitnami/mongodb
      tag: "latest"
      pullPolicy: IfNotPresent
    auth:
      enabled: false
    persistence:
      enabled: true
      size: 8Gi
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "1Gi"
        cpu: "500m"

  meilisearch:
    enabled: true
    auth:
      enabled: false
    environment:
      MEILI_NO_ANALYTICS: "true"
      MEILI_ENV: "development"
    persistence:
      enabled: true
      size: 1Gi
    resources:
      requests:
        memory: "128Mi"
        cpu: "50m"
      limits:
        memory: "512Mi"
        cpu: "250m"

Breaking down:

configEnv:
Environment variables converted to:

env:
  - name: APP_TITLE
    value: "LibreChat + Ollama (via Terraform)"

configYamlContent:
Multi-line YAML (pipe |) written into a ConfigMap and mounted as a file.

Helm processes:

configYamlContent: | ...

Creates:

apiVersion: v1
kind: ConfigMap
metadata:
  name: librechat-config
data:
  librechat.yaml: |
    version: 1.1.5
    cache: true
    ...

Mounts in:

volumeMounts:
  - name: config
    mountPath: /app/librechat.yaml
    subPath: librechat.yaml

Sub-charts (mongodb, meilisearch):

Located in the first librechat layer, not the second.

Structure:

librechat:           # dependency namespace
  librechat:         # chart's internal namespace
    configEnv: ...
    configYamlContent: ...

  mongodb: ...       # sub-chart (root level)
  meilisearch: ...   # sub-chart (root level)
  ingress: ...       # root level
  resources: ...     # root level

.gitignore

# Helm
charts/
Chart.lock

# Secrets
*-secrets.yaml
*.secret.yaml

# Backups
*.bak
*.tmp

# IDE
.vscode/
.idea/
*.swp

Prevents accidental commits of:

Downloaded charts (regeneratable)
Plaintext secrets
Temporary files

Deployment

Prerequisites

# Cluster Minikube
minikube start \
  --driver docker \
  --container-runtime docker \
  --gpus all \
  --memory 8192 \
  --cpus 4

minikube addons enable ingress

# Local DNS
echo "$(minikube ip) ollama.glukas.space" | sudo tee -a /etc/hosts
echo "$(minikube ip) librechat.glukas.space" | sudo tee -a /etc/hosts
echo "$(minikube ip) argocd.glukas.space" | sudo tee -a /etc/hosts

# Git repository created and populated
git clone https://github.com/usuario/k8s-apps.git
cd k8s-apps
# Copy Chart.yaml and values.yaml to apps/ollama/ and apps/librechat/
git add .
git commit -m "Initial commit"
git push origin main

Terraform: Configuration

# terraform.tfvars
cat > terraform.tfvars <<EOF
git_repo_url = "https://github.com/usuario/k8s-apps.git"
git_branch   = "main"

jwt_secret         = "$(openssl rand -hex 32)"
jwt_refresh_secret = "$(openssl rand -hex 32)"
creds_key          = "$(openssl rand -hex 32)"
creds_iv           = "$(openssl rand -hex 16)"
EOF

# .gitignore
echo "terraform.tfvars" >> .gitignore

Terraform: Init

cd terraform/
terraform init

Output:

Initializing provider plugins...
- Installing hashicorp/kubernetes v2.23.0...
- Installing hashicorp/helm v2.11.0...

Terraform has been successfully initialized!

Terraform: Plan

terraform plan -out=tfplan

Planned resources:

Plan: 7 to add, 0 to change, 0 to destroy.

Resources:
  + kubernetes_namespace.argocd
  + kubernetes_namespace.ollama
  + kubernetes_namespace.librechat
  + kubernetes_secret.librechat_credentials
  + helm_release.argocd
  + kubernetes_manifest.argocd_app_ollama
  + kubernetes_manifest.argocd_app_librechat

Terraform: Apply

terraform apply tfplan

Timeline:

[00:00-00:02] Namespaces created
[00:02-00:03] Secret created
[00:03-01:06] ArgoCD installed (Helm chart deployment)
[01:06-01:07] ArgoCD Applications registered (CRDs)

Apply complete! Resources: 7 added, 0 changed, 0 destroyed.

Note: Terraform only creates the platform. Apps will be deployed by ArgoCD.

ArgoCD: Initial Access

# Get password
ARGOCD_PASSWORD=$(minikube kubectl -- -n argocd get secret argocd-initial-admin-secret -o jsonpath='{.data.password}' | base64 -d)
echo "URL: http://argocd.glukas.space"
echo "User: admin"
echo "Pass: $ARGOCD_PASSWORD"

Access the UI:

Initial state:

Applications:
  ollama      Status: Syncing...
  librechat   Status: OutOfSync

ArgoCD: Sync Process

ArgoCD runs automatically:

1. Clone the repository:

git clone https://github.com/usuario/k8s-apps.git
git checkout main

2. Change detection:

Current SHA: abc123def456...
Last synced: (none - first sync)
Action: Sync required

3. Helm processing:

# For apps/ollama/
helm dependency build apps/ollama/
helm template ollama apps/ollama/ --values apps/ollama/values.yaml

# Generates YAML manifests

4. Application to the cluster:

kubectl apply -f <manifestos gerados>

5. Health checking:

Waiting:
  - Pods: Ready
  - Deployments: Available
  - StatefulSets: Ready

Observable timeline:

# Terminal 2: Monitor Ollama
watch kubectl get pods -n ollama

# Output evolves as ArgoCD syncs:
NAME                      READY   STATUS
ollama-xxx-yyy           0/1     Pending
ollama-xxx-yyy           0/1     ContainerCreating
ollama-xxx-yyy           0/1     Running  # Init container: pulling models
ollama-xxx-yyy           1/1     Running  # Ready (~2-3 min)

# Terminal 3: Monitor LibreChat
watch kubectl get pods -n librechat

# Output evolves:
NAME                                    READY   STATUS
librechat-mongodb-0                     0/1     Pending
librechat-meilisearch-xxx               0/1     ContainerCreating
librechat-xxx-yyy                       0/1     Pending

librechat-mongodb-0                     1/1     Running  # ~30s
librechat-meilisearch-xxx               1/1     Running  # ~25s
librechat-xxx-yyy                       1/1     Running  # ~1 min

After 3–5 minutes, ArgoCD UI shows:

Verification

# Ollama
curl http://ollama.glukas.space/api/tags
{
  "models": [
    {"name": "llama3.2:3b", ...},
    {"name": "deepseek-r1:14b", ...}
  ]
}

Operations

These five workflows cover the full operational lifecycle under GitOps: making a configuration change, upgrading a chart version, rolling back a broken deploy, understanding self-healing behaviour, and managing multiple environments. In every case the pattern is the same — edit files, push to Git, let ArgoCD do the rest. No kubectl apply, no terraform apply, no manual intervention required.

Workflow 1: Modify Configuration (Add a Model)

Objective: Add the mistral:latest model to Ollama.

Process:

# 1. Clone and branch
git clone https://github.com/usuario/k8s-apps.git
cd k8s-apps
git checkout -b add-mistral

# 2. Edit
vim apps/ollama/values.yaml

# Modify:
models:
  pull:
    - llama3.2:3b
    - deepseek-r1:14b
    - mistral:latest  # Added

# 3. Commit
git add apps/ollama/values.yaml
git commit -m "feat(ollama): Add mistral model"

# 4. Push
git push origin add-mistral

Pull Request:

Create PR on GitHub/GitLab
Visible diff:

 models:
   pull:
     - llama3.2:3b
     - deepseek-r1:14b
+    - mistral:latest

Review and approval
Merge to main

Automatic ArgoCD:

Timeline after merge:

[T+0 min]   Merge to main
[T+0-3 min] ArgoCD polling (waiting for next cycle)
[T+3 min]   ArgoCD detects new SHA
[T+3 min]   Calculates diff: + models.pull: mistral:latest
[T+3 min]   Helm upgrade ollama...
[T+4 min]   Rolling update initiated
[T+4-7 min] Init container: ollama pull mistral:latest
[T+7 min]   New pod Ready
[T+7 min]   Old pod Terminated
[T+7 min]   ArgoCD status: Synced ✓

Total time: ~7 minutes from merge to deploy.

Note: At no point did the developer execute any command against the cluster directly — the entire deployment was driven by a Git push.

Workflow 2: Chart Version Upgrade

Objective: Upgrade LibreChat from 1.9.7 to 1.10.0.

git checkout -b upgrade-librechat

vim apps/librechat/Chart.yaml

# Modify:
dependencies:
  - name: librechat
    version: "1.10.0"  # Era 1.9.7

git commit -am "chore(librechat): Upgrade to v1.10.0"
git push origin upgrade-librechat

CI/CD (optional):

# .github/workflows/helm-lint.yml
name: Helm Lint
on: pull_request

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: azure/setup-helm@v3

      - name: Lint
        run: |
          helm lint apps/*/

      - name: Template Test
        run: |
          helm template test apps/*/ > /dev/null

Pipeline runs:

Helm lint (syntax validation)
Template rendering (detects errors)

After approval and merge, ArgoCD deploys automatically.

Workflow 3: Rollback

Scenario: New deploy caused a problem in production.

Option 1: Git Revert

# View commits
git log --oneline apps/librechat/
# def456 chore(librechat): Upgrade to v1.10.0
# abc123 feat(ollama): Add mistral

# Revert
git revert def456
git push origin main

ArgoCD detects and applies the revert automatically.

Timeline: ~3–5 minutes.

Option 2: ArgoCD UI

1. Open http://argocd.glukas.space
2. Select the "librechat" application
3. Open the "History" tab
4. Sync list:
   Sync 5: def456 (current) ❌
   Sync 4: abc123 ✅
5. Click on Sync 4
6. Click "Rollback"
7. Confirm

Timeline: ~30 segundos.

Important: Rollback via UI is temporary. The next poll will re-sync with Git. For permanence, perform a git revert.

Option 3: ArgoCD CLI

# Install CLI
brew install argocd  # ou método apropriado

# Login
argocd login argocd.glukas.space --username admin

# View history
argocd app history librechat

# Rollback
argocd app rollback librechat <REVISION>

Workflow 4: Self-Healing

Scenario: Manual change in the cluster.

# Someone runs:
kubectl scale deployment ollama --replicas=3 -n ollama

ArgoCD response:

[T+0s]     kubectl scale executed
[T+0s]     Deployment: replicas=3
[T+0-180s] ArgoCD polling interval
[T+180s]   ArgoCD detects drift:
           Git: replicas=1
           Cluster: replicas=3
[T+181s]   Self-heal triggered
           kubectl apply -f deployment.yaml (from Git)
[T+182s]   Kubernetes: replicas=1
           3 extra pods terminated
[T+183s]   ArgoCD status: Synced ✓
           Event: "Self-healed: ollama deployment"

Manual change was automatically reverted.

Responsible configuration:

syncPolicy = {
  automated = {
    selfHeal = true  # This parameter enables automatic revert
  }
}

Disable self-heal:

syncPolicy = {
  automated = {
    prune = true
    selfHeal = false  # Allows manual changes to persist
  }
}

Workflow 5: Multi-Environment

Structure:

k8s-apps/
├── apps/
│   └── ollama/
│       ├── Chart.yaml
│       ├── values-dev.yaml
│       ├── values-staging.yaml
│       └── values-prod.yaml

Differentiated values:

# values-dev.yaml
ollama:
  ollama:
    models:
      pull:
        - llama3.2:3b  # Lightweight model only
    resources:
      limits:
        memory: "2Gi"

# values-prod.yaml
ollama:
  ollama:
    models:
      pull:
        - llama3.2:3b
        - deepseek-r1:14b
        - mistral:latest
    resources:
      limits:
        memory: "8Gi"

ArgoCD Applications (Terraform):

# Dev
resource "kubernetes_manifest" "argocd_app_ollama_dev" {
  manifest = {
    spec = {
      source = {
        repoURL        = var.git_repo_url
        targetRevision = "develop"  # develop branch
        path           = "apps/ollama"
        helm = {
          valueFiles = ["values-dev.yaml"]
        }
      }
      destination = {
        namespace = "ollama-dev"
      }
    }
  }
}

# Prod
resource "kubernetes_manifest" "argocd_app_ollama_prod" {
  manifest = {
    spec = {
      source = {
        repoURL        = var.git_repo_url
        targetRevision = "main"  # main branch
        path           = "apps/ollama"
        helm = {
          valueFiles = ["values-prod.yaml"]
        }
      }
      destination = {
        namespace = "ollama-prod"
      }
    }
  }
}

Promotion flow:

Feature branch → develop (PR) → auto-deploy to Dev
              → staging (PR)  → auto-deploy to Staging
              → main (PR + approvals) → auto-deploy to Prod

Git branches map to environments.

Troubleshooting

This section covers the most common failure modes when running ArgoCD in practice — what they look like, why they happen, and how to fix them.

Problem 1: Application OutOfSync

An OutOfSync status means ArgoCD has detected a difference between what's in Git and what's running in the cluster, but hasn't been able to resolve it. This is usually the first sign that something went wrong during a sync — not necessarily a cluster problem, but worth investigating immediately.

Symptom:

kubectl get application -n argocd
NAME      SYNC STATUS   HEALTH STATUS
ollama    OutOfSync     Unknown

Diagnosis:

# Describe Application
kubectl describe application ollama -n argocd

# View events
kubectl get events -n argocd --sort-by='.lastTimestamp'

# repo-server logs
kubectl logs -n argocd deployment/argocd-repo-server

# application-controller logs
kubectl logs -n argocd statefulset/argocd-application-controller

Common causes:

YAML syntax error

Error: YAML parse error line 15: mapping values are not allowed here

Solution: Fix syntax in values.yaml

Chart version not found

Error: chart "ollama" version "1.42.0" not found

Solution: Check available versions:

helm search repo ollama --versions

Repository unreachable

Error: failed to fetch https://github.com/usuario/k8s-apps.git: authentication required

Solution: Configure credentials in ArgoCD

Local validation:

# Test template rendering
cd k8s-apps/
helm dependency build apps/ollama/
helm template test apps/ollama/

# If there's an error, it will appear here

Problem 2: Pods CrashLoopBackOff

A CrashLoopBackOff means the pod is starting, failing, and being restarted repeatedly. ArgoCD may show the application as Synced — meaning the deployment was applied correctly — but Degraded on health, because the pod never reaches a running state. The problem is almost always in the container itself, not in ArgoCD.

Symptom:

kubectl get pods -n ollama
NAME            READY   STATUS             RESTARTS
ollama-xxx      0/1     CrashLoopBackOff   5

Diagnosis:

# Current pod logs
kubectl logs -n ollama ollama-xxx

# Previous container logs (if it has restarted)
kubectl logs -n ollama ollama-xxx --previous

# Pod events and conditions
kubectl describe pod -n ollama ollama-xxx

Common causes:

GPU not available

Error: failed to initialize NVML: could not load NVML library

Solution:

# Temporarily disable GPU
ollama:
  ollama:
    gpu:
      enabled: false

Insufficient memory

Error: OOMKilled

Solution:

resources:
  limits:
    memory: "8Gi"  # Increase

Model does not exist

Error: pulling model: model 'llama4' not found

Solution: Check model name on values.yaml

Problem 3: Double Hierarchy Not Applied

This is one of the trickier failure modes because ArgoCD reports everything as healthy — the sync succeeded, no errors are visible, but the configuration simply isn't taking effect. It typically happens when the Helm values file is missing one level of nesting, causing the GPU settings to silently fall back to defaults.

Symptom:

ArgoCD shows Synced
GPU not enabled
No visible errors

Diagnosis:

# Render full template
helm template test apps/ollama/

# Search for GPU configuration
helm template test apps/ollama/ | grep -A10 "nvidia.com/gpu"

# If not found, structure is wrong

Cause:

# ❌ Incorrect structure (one layer)
ollama:
  gpu:
    enabled: true

Solution:

# ✅ Correct structure (double layer)
ollama:
  ollama:
    gpu:
      enabled: true

Validation:

# After correction, check diff in ArgoCD UI
# Should show change in spec.template.spec.containers[].resources

Problem 4: Slow Sync

Unlike the previous problems, slow sync isn't a failure — it's expected behavior that becomes surprising when you first encounter it. ArgoCD doesn't watch Git in real time; it polls on a fixed interval, so there will always be a delay between a git push and a deployment.

Symptom:
ArgoCD takes >5 minutes to detect changes.

Cause:
Default polling interval is 3 minutes.

Solution 1: Adjust polling

# values/argocd-values.yaml
server:
  config:
    timeout.reconciliation: 60s  # 1 minute

Trade-off: More load on the cluster and Git repo.

Solution 2: Webhook

Configure a webhook in Git to notify ArgoCD:

# GitHub webhook URL
POST https://argocd.glukas.space/api/webhook

ArgoCD syncs immediately upon receiving a push.

Solution 3: Manual sync

# Via CLI
argocd app sync ollama

# Via UI
Click "Sync" on the application

Ch. 3 vs Ch. 4: When to Use Each

Both approaches are valid, and the right choice depends on team size, deploy frequency, and how much operational overhead you want to absorb upfront. The table below maps the key trade-offs to help you decide:

	Cap 3 (Terraform + Helm)	Cap 4 (Terraform + ArgoCD)
Deploy trigger	Manual: `terraform apply`	Automatic: Git push
Latency	Immediate	3 min (polling)
Reconciliation	Manual: `terraform plan`	Continuous: 3-min loop
Drift detection	Manual	Automatic
Self-healing	Does not exist	Configurable (selfHeal)
Rollback	`git revert` + `terraform apply`	ArgoCD UI (1 click) or `git revert`
Audit trail	Git + Terraform logs	Git + ArgoCD events
Multi-env	Duplicate code or workspaces	Branches + valueFiles
Permissions required	kubectl + Terraform	Git only
Disaster recovery	Re-run Terraform	Automatic ArgoCD re-sync
State management	Terraform state (central)	Git (distributed)
Initial complexity	Medium	High
Scalability (apps)	~20 apps	Unlimited
Ideal team size	1–10	10+

Chapter 3's approach is simpler to set up and sufficient for small teams with controlled deploy cadences — if a weekly terraform apply is acceptable, the added complexity of ArgoCD is not justified. Chapter 4 becomes the right choice once teams grow, deploy frequency increases, or compliance requirements demand an immutable audit trail and automated drift correction. The two are not mutually exclusive: many organisations start with Chapter 3 and migrate to Chapter 4 as their operational maturity grows.

Conclusion

Chapters 1 through 4 trace a deliberate progression — from manual kubectl commands to a fully automated, self-healing platform. Each chapter addressed a specific limitation of the one before it: verbosity, the need for manual execution, the absence of continuous reconciliation. The cumulative result is an architecture where Git is the single source of truth, and the cluster enforces that truth on its own.

The four GitOps principles are not just theoretical framing — each one translates directly into an operational guarantee. Declarative configuration means the desired state is always readable and auditable without touching the cluster. Version control means every change has an author, a rationale, and a rollback path. Pull-based deployment means no external system ever needs credentials to reach the cluster — the cluster reaches out to Git. Continuous reconciliation means drift is detected and corrected automatically, without anyone noticing or reacting.

The architecture also enforces a clean separation of concerns that makes each layer independently replaceable:

Terraform → Platform bootstrap (namespaces, secrets, ArgoCD)
Git       → Application desired state
ArgoCD    → Reconciliation engine
Helm      → Packaging and templating

Changes to one layer do not cascade into the others. You could swap Helm for raw manifests, or replace Terraform with a different provisioner, without touching ArgoCD or the Git repository structure.

This foundation is deliberately extensible. The next steps — security, observability, multi-tenancy — build on top of it without requiring the core architecture to change.

Maturity Journey

Each chapter in this series represents a deliberate step up the maturity ladder — not just in tooling, but in ownership model, speed, and scale:

Stage 1: Manual Deployment (Ch. 1)

Maturity:  Ad-hoc
Ownership: Individuals
Speed:     Slow (days/weeks)
Scale:     Doesn't scale

Stage 2: Infrastructure as Code (Ch. 2–3)

Maturity:  Repeatable
Ownership: Ops team
Speed:     Medium (hours/days)
Scale:     Limited (manual execution)

Stage 3: GitOps Foundation (Ch. 4) ← We are here

Maturity:  Automated
Ownership: Shared (platform + dev)
Speed:     Fast (minutes/hours)
Scale:     Good (self-service enabled)

Stage 4: Infrastructure as Product (Next Steps)

Maturity:  Product-driven
Ownership: Platform teams (product owners)
Speed:     Very fast (minutes)
Scale:     Excellent (true self-service)
Metrics:   DORA, satisfaction, adoption

What Comes Next

Stage 3 is a foundation, not a destination. The architecture built in this chapter is intentionally minimal — one team, two applications, one cluster — and that is the right place to start. But the same GitOps primitives that make this setup work at small scale are exactly what allow it to grow.

The diagram below shows the current state: a single developer workflow, a flat namespace structure, and ArgoCD managing two specific workloads with no shared services, no multi-tenancy, and no separation between platform concerns and application concerns.

The target looks substantially different. The cluster is split into two distinct layers: a Platform Layer of shared services — security, observability, secrets management — owned by a dedicated platform team with SLAs and roadmaps; and a Workload Layer where individual product teams deploy independently via git push, without ever touching the platform layer beneath them.

The gap between the two diagrams is not a rewrite — it is an incremental build. Every component in the Platform Layer gets added as an ArgoCD-managed application in its own namespace, following the exact same wrapper-chart pattern introduced in this chapter. The core architecture does not change; it simply gains more managed services over time.

The next chapters will build out this platform layer starting with the highest-impact additions: security, observability, and secrets management.

Initiative	Domain	Phase	Complexity	Impact	Dependencies	Time	Priority
Pomerium	SECURITY	Foundation	Intermediate	High	ArgoCD	3-5d	P0
Sealed Secrets	SECURITY	Foundation	Basic	High	None	1d	P0
Authentik	SECURITY	Foundation	Intermediate	High	PostgreSQL	3-5d	P0
Prometheus + Grafana	OBSERVABILITY	Foundation	Intermediate	High	None	3-5d	P0
MCP Servers	INTEGRATION	Foundation	Intermediate	High	None	2-3d	P0
RAG (Qdrant)	AI/LLM	Foundation	Advanced	High	None	1w	P1
LangSmith/Langfuse	AI/LLM	Scale	Advanced	High	Prometheus	5-7d	P1
Autoscaling	INFRA	Scale	Intermediate	High	Prometheus	2-3d	P1
Loki	OBSERVABILITY	Scale	Basic	Medium	Grafana	1-2d	P1
SearXNG	INTEGRATION	Scale	Basic	Medium	None	1d	P1
Web Scraper	INTEGRATION	Scale	Intermediate	Medium	None	2d	P1
Tilt	DEVEX	Scale	Basic	Medium	None	1d	P1
Jaeger	OBSERVABILITY	Production Excellence	Advanced	Medium	Prometheus	3-5d	P2
Model Registry	AI/LLM	Production Excellence	Intermediate	Medium	None	3-5d	P2
Multi-region	NETWORK	Production Excellence	Expert	Medium	ArgoCD	2w+	P3
Fine-tuning	AI/LLM	Production Excellence	Expert	Low	Registry	2w	P3

Platform Products (Shared Services):

Pomerium + Authentik:
  product: "Authentication & Authorization Platform"
  customers: "All applications"
  value: "SSO, MFA, zero-trust"
  sla: "99.9% uptime, <200ms auth latency"
  roadmap: ["RBAC granular", "SAML support", "API keys"]

Prometheus + Grafana + Loki:
  product: "Observability Platform"
  customers: "All teams (dev + ops)"
  value: "Unified metrics/logs/traces"
  sla: "30d retention, <5s query time"
  roadmap: ["AIOps", "Cost attribution", "SLO management"]

Sealed Secrets:
  product: "Secrets Management Platform"
  customers: "All teams"
  value: "Git-native secrets, rotation, audit"
  sla: "Zero exposure, <1min sync"
  roadmap: ["Vault integration", "RBAC", "Expiration"]

Workload-Specific Products:

RAG (Qdrant):
  product: "Vector Search Service"
  customers: "AI/ML teams"
  value: "Semantic search, embeddings"
  sla: "<100ms p95 search latency"
  roadmap: ["Multi-model", "Hybrid search"]

MCP Servers:
  product: "Tool Integration Platform"
  customers: "LLM applications"
  value: "Connect LLMs to tools"
  sla: "<50ms tool invocation"
  roadmap: ["Custom tools", "Async execution"]

Developer Experience Products:

Tilt:
  domain: "[DEVEX]"
  phase: "Scale"
  complexity: "Basic"
  impact: "Medium"
  dependencies: ["None"]
  time: "1 day"
  priority: "P1"
  product: "Local Development Platform"
  customers: "All developers"
  value: "Hot-reload, real K8s environment, fast iteration"
  sla: "<5s code sync, <10s service restart"
  roadmap: ["Remote development", "Debugging tools", "Resource snapshots"]

Infrastructure Products:

Multi-region:
  domain: "[NETWORK]"
  phase: "Production Excellence"
  complexity: "Expert"
  impact: "Medium"
  dependencies: ["ArgoCD"]
  time: "2+ weeks"
  priority: "P3"
  product: "Global Load Balancing & Geo-distribution"
  customers: "All production workloads"
  value: "Low latency worldwide, compliance (data residency)"
  sla: "99.99% global availability, <100ms cross-region failover"
  roadmap: ["Active-active", "Traffic shaping", "Cost optimization", "DR automation"]

Production Recommended Extensions

TLS/HTTPS:

# argocd-values.yaml
server:
  ingress:
    tls:
      - secretName: argocd-tls
        hosts:
          - argocd.empresa.com

SSO/OIDC:

server:
  config:
    url: https://argocd.empresa.com
    oidc.config: |
      name: Okta
      issuer: https://empresa.okta.com
      clientID: $oidc.okta.clientId
      clientSecret: $oidc.okta.clientSecret

RBAC:

server:
  rbacConfig:
    policy.csv: |
      p, role:developers, applications, get, */*, allow
      p, role:developers, applications, sync, */*, allow
      g, developers-group, role:developers

Notifications:

notifications:
  enabled: true
  notifiers:
    service.slack: |
      token: $slack-token
  templates:
    template.app-deployed: |
      message: Application {{.app.metadata.name}} deployed
  triggers:
    trigger.on-deployed: |
      - when: app.status.operationState.phase in ['Succeeded']
        send: [app-deployed]

Application Sets:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-apps
spec:
  generators:
    - git:
        repoURL: https://github.com/empresa/k8s-apps.git
        revision: HEAD
        directories:
          - path: apps/*
  template:
    metadata:
      name: '{{path.basename}}'
    spec:
      source:
        repoURL: https://github.com/empresa/k8s-apps.git
        path: '{{path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{path.basename}}'

Monitoring:

# ServiceMonitor para Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-metrics
  endpoints:
    - port: metrics