Each chapter in this series has been a deliberate step toward removing friction from LLM infrastructure management. Chapter 2 introduced Infrastructure as Code by mapping Kubernetes resources directly to Terraform — functional, but painfully verbose. Chapter 3 resolved the abstraction problem by bringing in the Helm provider, collapsing 500 lines of HCL into 50 and allowing Terraform to reason about applications rather than individual resources. Both approaches, however, shared the same fundamental constraint: every change still required a human to run terraform apply. The cluster had no awareness of Git, drift went undetected until someone noticed, and scaling that model to larger teams or more frequent deploys would inevitably make manual execution a bottleneck. Chapter 4 closes that loop by introducing GitOps with ArgoCD — making the cluster itself responsible for continuously reconciling its state against Git, without anyone needing to trigger a command.
The Four Principles of GitOps
GitOps is an operational paradigm built on four interconnected principles:
1. Declarative
Everything the cluster needs to run is expressed as YAML files. There is no procedural "run this command" — only a description of the desired end state.
2. Versioned
All state lives in Git. Every change has an author, a timestamp, and a diff. Rollback becomes as simple as reverting a commit.
3. Pull-based (key differentiator)
Cap 3: Developer → terraform apply → push → Cluster
Cap 4: Developer → git push → Git ← poll ← ArgoCD (no cluster)
4. Continuous reconciliation
ArgoCD runs an infinite loop:
while true:
git_state = fetch_from_git()
cluster_state = fetch_from_kubernetes()
if git_state != cluster_state:
apply_changes()
sleep(180) # 3 minutes
Context and Rationale
With the mechanics of GitOps established, it is worth stepping back to understand why this shift matters beyond a single team running git push. Removing terraform apply from the workflow is only the surface benefit — the deeper payoff is that GitOps creates the operational foundation for treating infrastructure as a product, with real ownership, roadmaps, and SLAs.
Growing Cloud Complexity
Cloud infrastructure promised simplicity. In 2010, AWS offered ~20 intuitive services, provisioned with clicks in the UI. The value proposition was clear: simplify infrastructure.
Today, reality is different:
AWS (2010):
├── ~20 services
├── Provisioning via UI
└── Simplicity as core value
AWS (2025):
├── 175+ services
├── Provisioning via IaC
└── Complexity as new reality
Organizational impact:
According to Max Griffiths (Thoughtworks):
"Rising cloud complexity is putting many organizations and their infrastructure teams right back to where they were 15 years ago — struggling to keep up with demand for new services and instances, and stay on top of an increasingly unmanageable infrastructure footprint."
Consequences:
- Time to market increased (instead of decreasing)
- Required skills grew exponentially
- Self-service became impractical
- Full-stack developer scope expanded to the point of becoming a disadvantage
Infrastructure as Product
Infrastructure as Product treats infrastructure not as a centralized service provider, but as a portfolio of internal products that enable product teams to deliver value quickly.
Paradigm shift:
Traditional (project-based):
Developer → Ticket → Ops Team → Wait → Provision → Deploy
Timeline: Days/weeks
Bottleneck: Ops team on critical path
Infrastructure as Product:
Developer → Self-service Platform → Provision → Deploy
Timeline: Minutes/hours
Enabler: Platform removes friction
Three Core Principles
1. Developer as Customer
Infrastructure should be designed around developer experience first. If the platform is hard to use, it won't be adopted — regardless of how technically sound it is.
Success metrics:
- Developer satisfaction score
- Time to provision (target: <30 min)
- Platform adoption rate
- Support tickets reduction
Anti-pattern:
# Requires learning a new language just to provision
developer:
must_learn: [HCL, Pulumi, CloudFormation]
to_do: "Same work as before"
result: "Poor developer experience"
Correct pattern:
# Familiar interface, correct abstraction
developer:
uses: [YAML, familiar APIs]
self_service: true
result: "High adoption, low friction"
2. Platform Teams With a Product Mindset
According to Sebastian Straube (Accenture), infrastructure teams shouldn't operate as a shared service that reacts to tickets — they should be restructured into dedicated platform product teams, each owning a slice of the internal platform with a roadmap, SLAs, and real accountability to their users.
Organization example:
Platform Team: Compute & Container
├── Product: Kubernetes-as-a-Service
├── Customers: All development teams
├── Roadmap:
│ ├── Q1: Auto-scaling GPU nodes
│ ├── Q2: Service mesh integration
│ └── Q3: Cost optimization
├── SLA: 99.9% uptime, <5min provision
└── Metrics: Adoption rate, developer satisfaction
Platform Team: Observability
├── Product: Unified logs/metrics/traces
├── Customers: All teams (dev + ops)
├── Roadmap:
│ ├── Q1: 30-day retention
│ ├── Q2: AIOps integration
│ └── Q3: Cost attribution
├── SLA: <5s query response
└── Metrics: Query volume, alert accuracy
Characteristics:
- Long-lived teams (not project-based)
- Own product roadmap and backlog
- Accountable for adoption and satisfaction
- Measure success by customer outcomes
3. Self-Service as a Core Value
The platform must enable true self-service — meaning the platform team is never on the critical path of a deployment.
Anti-pattern:
Developer → Submit PR → Platform review → Approve → Deploy
Problem: Platform team on critical path (bottleneck)
Correct pattern:
Developer → Use platform API → Automated validation → Deploy
Platform: Monitor outcomes
Real-world example:
# Developer workflow
kubectl apply -f app.yaml
# Platform validates automatically:
# ✓ Security policies (OPA)
# ✓ Resource quotas
# ✓ Network policies
# ✓ Deploy without manual approval
GitOps as the Technical Foundation
GitOps is not just "automation" — it is the enabling layer that makes Infrastructure as Product operationally viable. Each of its four properties maps directly to a platform capability:
1. Declarative = Product Catalog
The Git repository becomes a catalog of available products that any team can consume, customize, and deploy independently.
k8s-apps/
├── apps/ollama/ # Product: LLM Inference
├── apps/librechat/ # Product: Chat Interface
└── apps/postgres/ # Product: Database
Developer "consumes" a product via:
git clone k8s-apps
cp -r apps/ollama apps/my-llm
vim apps/my-llm/values.yaml # Customize
git commit && git push
# ArgoCD auto-deploy
2. Self-Service = Developer Autonomy
ArgoCD removes the platform team from the critical path:
Without ArgoCD:
Developer → Code → Request deploy → Platform team → Manual deploy
With ArgoCD:
Developer → Code → Git push → ArgoCD auto-sync → Deployed
3. API-Driven = Programmatic Access
ArgoCD Application CRDs are the deployment API:
# Developer creates workload via Kubernetes API
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-service
spec:
source:
repoURL: https://github.com/company/k8s-apps
path: apps/my-service
destination:
server: https://kubernetes.default.svc
namespace: my-namespace
syncPolicy:
automated:
prune: true
selfHeal: true
4. Standardization With Flexibility
Helm charts serve as reusable product templates — the platform defines the structure, teams supply the configuration.
Platform provide templates:
├── web-app/ # Template for web apps
├── ml-service/ # Template for ML workloads
└── data-pipeline/ # Template for ETL
Developer customize:
cp -r templates/web-app apps/my-app
vim apps/my-app/values.yaml # Configuration only
git push # Automatic deployment
Architectural Decision
App of Apps with Independent Wrappers
Rather than managing all services as a single unit, this architecture treats each service as an independent package within the App of Apps pattern. GitOps manages deployments, versions, and updates at the individual service level — not the stack level.
Each service (Ollama, LibreChat) has its own Helm chart wrapper that:
- References the upstream chart via
dependenciesinChart.yaml - Customizes configuration via a local
values.yaml - Maintains independent versioning
- Allows isolated evolution
Wrapper Structure
apps/
├── librechat/
│ ├── Chart.yaml # Wrapper chart with dependency
│ └── values.yaml # Specific configuration
└── ollama/
├── Chart.yaml # Wrapper chart with dependency
└── values.yaml # Specific configuration
Chart.yaml defines the dependency:
dependencies:
- name: ollama
version: "1.42.0" # Fixed version
repository: "https://otwld.github.io/ollama-helm/"
values.yaml customizes the upstream chart:
ollama: # Wrapper namespace
ollama: # Chart namespace (double hierarchy)
gpu:
enabled: true
models:
pull:
- llama3.2:3b
Benefits of This Approach
1. Service Independence
- Each app evolves at its own pace
- Deploying Ollama does not affect LibreChat
- Reduces the risk of cross-service regressions
2. Granular Versioning
apps/ollama/Chart.yaml: version: "1.42.0"
apps/librechat/Chart.yaml: version: "1.9.7"
- Upstream chart versions pinned individually
- Upgrades tested and applied per service
- Independent rollback per application
3. Isolated Customization
- Values specific per service
- No configuration conflicts
- Individual testability
4. Per-Service Observability
- ArgoCD Application per service
- Isolated logs and events
- Specific health checking
5. Automated Deployment
Git commit → ArgoCD detects → Helm processes wrapper → Deploy to cluster
- ArgoCD manages each wrapper as an Application
- Automatic sync per service
- Independent self-healing
6. Complete Tracking
git log apps/ollama/values.yaml
# Complete change history for Ollama
git log apps/librechat/values.yaml
# Complete change history for LibreChat
- Audit trail per service
- Specific PRs per change
- Granular rollback via Git
Types of Wrappers
Wrapper with Public Chart (HTTP):
# apps/ollama/Chart.yaml
dependencies:
- name: ollama
version: "1.42.0"
repository: "https://otwld.github.io/ollama-helm/"
Wrapper with OCI Chart:
# apps/librechat/Chart.yaml
dependencies:
- name: librechat
version: "1.9.7"
repository: "oci://ghcr.io/danny-avila/librechat-chart"
Wrapper with Sub-charts:
# apps/custom-app/Chart.yaml
dependencies:
- name: app
version: "1.0.0"
repository: "https://..."
- name: postgres
version: "12.0.0"
repository: "https://charts.bitnami.com/bitnami"
- name: redis
version: "17.0.0"
repository: "https://charts.bitnami.com/bitnami"
Governance and Standardization
No tight coupling:
- Each service maintains flexibility
- Standards enforced via linting (CI/CD)
- Common templates available, not mandatory
Example of suggested standard:
# All services follow label convention
metadata:
labels:
app.kubernetes.io/name: {{ .Chart.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Chart.AppVersion }}
app.kubernetes.io/managed-by: argocd
But each service can add specific labels without breaking others.
Alternatives Considered and Rejected
❌ Single monorepo without wrappers:
# values.yaml (monolítico)
ollama:
gpu: ...
librechat:
config: ...
postgres:
...
Problems:
- A change in Ollama affects LibreChat's versioning
- Atomic all-or-nothing deployment
- Rollback impacts all services
❌ Single custom chart:
# custom-chart/
templates/
ollama.yaml
librechat.yaml
postgres.yaml
Problems:
- Reinventing the wheel (official charts already exist)
- Heavy maintenance burden
- Complicates upstream upgrades
✅ Independent wrappers (chosen):
- Reuses upstream charts
- Independence between services
- Easy maintenance
- Flexible governance
Trade-offs
Advantages:
- Full independence between services
- Granular versioning
- Individual rollback
- Isolated observability
- Scalability (adding an app = new directory)
Disadvantages:
- Duplication of common configurations (e.g., ingress annotations)
- Requires linting to enforce standards
- More files to manage
Considerations
Use when:
- Medium/large team (5+ people)
- Multiple independent services
- Frequent deploys per service
- Need for granular rollback
- Distributed governance (teams own services)
Do not use when:
- Monolithic application (single service)
- Very small team (1–2 people)
- All services always deployed together
- Preference for extreme simplicity
Implementing Terraform
This section implements main.tf. It does everything Terraform is responsible for in this architecture — which is intentionally narrow. Rather than managing applications directly, main.tf bootstraps the platform once and then hands control to ArgoCD. Concretely, it does four things in sequence: creates the three namespaces, provisions the credentials secret that LibreChat needs at runtime, installs ArgoCD via Helm, and registers the two ArgoCD Application CRDs that point at the Git repository. After that first terraform apply, Terraform is largely out of the picture — all subsequent application changes flow through Git.
The file is organised into four logical blocks, each separated by a comment header:
main.tf
├── Providers — Kubernetes + Helm provider config
├── Infrastructure — Namespaces + credentials secret
├── ArgoCD — Helm installation + values reference
└── ArgoCD Applications — CRDs that register Ollama and LibreChat
The subsections below walk through each block in order.
Providers
(main.tf lines 1–27 — terraform {} block + provider declarations)
terraform {
required_version = ">= 1.0"
required_providers {
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.23"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.11"
}
}
}
provider "kubernetes" {
config_path = "~/.kube/config"
config_context = "minikube"
}
provider "helm" {
kubernetes {
config_path = "~/.kube/config"
config_context = "minikube"
}
}
The provider configuration is identical to Chapter 3: both the kubernetes and helm providers read from ~/.kube/config and target the minikube context. No changes needed here.
Namespaces
(main.tf — INFRASTRUCTURE block)
resource "kubernetes_namespace" "argocd" {
metadata {
name = "argocd"
labels = {
managed-by = "terraform"
purpose = "gitops"
}
}
}
resource "kubernetes_namespace" "ollama" {
metadata {
name = "ollama"
labels = {
managed-by = "terraform"
}
}
}
resource "kubernetes_namespace" "librechat" {
metadata {
name = "librechat"
labels = {
managed-by = "terraform"
}
}
}
Design decision: Namespaces are managed by Terraform rather than ArgoCD because they are prerequisites for everything else — ArgoCD itself cannot deploy into a namespace that does not yet exist. They also change rarely enough that the overhead of GitOps reconciliation is not justified. The purpose = "gitops" label distinguishes tooling namespaces from application namespaces at a glance.
Infrastructure Secrets
(main.tf — still inside the INFRASTRUCTURE block, immediately after the namespaces)
resource "kubernetes_secret" "librechat_credentials" {
metadata {
name = "librechat-credentials-env"
namespace = kubernetes_namespace.librechat.metadata[0].name
}
data = {
JWT_SECRET = var.jwt_secret
JWT_REFRESH_SECRET = var.jwt_refresh_secret
CREDS_KEY = var.creds_key
CREDS_IV = var.creds_iv
MONGO_URI = "mongodb://librechat-mongodb:27017/LibreChat"
MEILI_HOST = "http://librechat-meilisearch:7700"
OLLAMA_BASE_URL = "http://ollama.ollama.svc.cluster.local:11434"
}
type = "Opaque"
}
Design decision: Credentials are managed by Terraform precisely because they must not appear in Git. Committing plaintext secrets to a repository — even a private one — violates the principle of least exposure. For production environments, the right path is to replace this with Sealed Secrets, External Secrets Operator, or HashiCorp Vault, all of which allow secrets to live in Git in an encrypted or referenced form. Chapter 5 covers this.
ArgoCD Installation
(main.tf — ARGOCD block)
resource "helm_release" "argocd" {
name = "argocd"
repository = "https://argoproj.github.io/argo-helm"
chart = "argo-cd"
namespace = kubernetes_namespace.argocd.metadata[0].name
version = "5.51.6"
values = [
file("${path.module}/values/argocd-values.yaml")
]
timeout = 600
wait = true
wait_for_jobs = true
depends_on = [
kubernetes_namespace.argocd
]
}
ArgoCD Applications (CRDs)
(main.tf — ARGOCD APPLICATIONS block — the last block in the file)
With ArgoCD installed, the final step is registering the applications it should manage. Each kubernetes_manifest block below creates an ArgoCD Application CRD — a declarative instruction telling ArgoCD which Git path to watch, which cluster namespace to deploy into, and how to handle sync and drift.
Ollama:
resource "kubernetes_manifest" "argocd_app_ollama" {
manifest = {
apiVersion = "argoproj.io/v1alpha1"
kind = "Application"
metadata = {
name = "ollama"
namespace = kubernetes_namespace.argocd.metadata[0].name
labels = {
managed-by = "terraform"
}
}
spec = {
project = "default"
source = {
repoURL = var.git_repo_url
targetRevision = var.git_branch
path = "apps/ollama"
helm = {
valueFiles = ["values.yaml"]
}
}
destination = {
server = "https://kubernetes.default.svc"
namespace = kubernetes_namespace.ollama.metadata[0].name
}
syncPolicy = {
automated = {
prune = true
selfHeal = true
}
syncOptions = [
"CreateNamespace=false"
]
}
}
}
depends_on = [
helm_release.argocd,
kubernetes_namespace.ollama
]
}
Breaking Down:
spec.source:
source = {
repoURL = "https://github.com/<user>/k8s-apps.git"
targetRevision = "main"
path = "apps/ollama"
helm = {
valueFiles = ["values.yaml"]
}
}
repoURL: Git repository to watch
targetRevision: Branch, tag, or specific SHA
path: Directory within the repo
helm.valueFiles: Array of values files (merged in order)
spec.destination:
destination = {
server = "https://kubernetes.default.svc"
namespace = "ollama"
}
server: API server of the target cluster
-
kubernetes.default.svc: Local cluster (where ArgoCD is) - External URL: Deploy to a remote cluster
namespace: Destination namespace in the cluster
spec.syncPolicy:
syncPolicy = {
automated = {
prune = true
selfHeal = true
}
syncOptions = [
"CreateNamespace=false"
]
}
automated:
- Present: ArgoCD syncs automatically upon detecting changes
- Absent: Manual sync only
prune: true:
- Resources deleted from Git are deleted from the cluster
-
false: Orphaned resources remain in the cluster
selfHeal: true:
- Manual changes in the cluster are reverted
- ArgoCD forces state = Git
-
false: Manual changes persist (drift)
syncOptions:
-
CreateNamespace=false: Do not create namespace (already exists via Terraform) -
CreateNamespace=true: Create if it doesn't exist -
Validate=false: Skip resource validation -
PruneLast=true: Delete orphaned resources last
LibreChat:
resource "kubernetes_manifest" "argocd_app_librechat" {
manifest = {
apiVersion = "argoproj.io/v1alpha1"
kind = "Application"
metadata = {
name = "librechat"
namespace = kubernetes_namespace.argocd.metadata[0].name
labels = {
managed-by = "terraform"
}
}
spec = {
project = "default"
source = {
repoURL = var.git_repo_url
targetRevision = var.git_branch
path = "apps/librechat"
helm = {
valueFiles = ["values.yaml"]
}
}
destination = {
server = "https://kubernetes.default.svc"
namespace = kubernetes_namespace.librechat.metadata[0].name
}
syncPolicy = {
automated = {
prune = true
selfHeal = true
}
syncOptions = [
"CreateNamespace=false"
]
}
}
}
depends_on = [
helm_release.argocd,
kubernetes_namespace.librechat,
kubernetes_secret.librechat_credentials
]
}
The LibreChat Application CRD follows the same structure as Ollama. The only meaningful difference is that it carries an additional depends_on reference to kubernetes_secret.librechat_credentials, ensuring the credentials exist in the cluster before ArgoCD attempts its first sync.
Outputs
(main.tf — OUTPUTS block)
output "argocd_url" {
value = "http://argocd.glukas.space"
}
output "argocd_admin_password" {
value = try(data.kubernetes_secret.argocd_initial_admin.data["password"], "")
sensitive = true
}
output "argocd_password_command" {
value = "minikube kubectl -- -n argocd get secret argocd-initial-admin-secret -o jsonpath='{.data.password}' | base64 -d"
}
output "applications_managed" {
value = {
ollama = {
namespace = kubernetes_namespace.ollama.metadata[0].name
status = "Managed by ArgoCD"
}
librechat = {
namespace = kubernetes_namespace.librechat.metadata[0].name
status = "Managed by ArgoCD"
}
}
}
try() function:
try(expression, fallback)
Attempts to execute expression, returns fallback if it fails. Avoids errors when the secret does not yet exist.
With the outputs block in place, main.tf is complete. The full file now captures the entire platform bootstrap in roughly 200 lines: provider config, namespaces, one credentials secret, one Helm release, two ArgoCD Application CRDs, and four outputs. Running terraform apply once produces a live ArgoCD instance watching your Git repository — everything after that point is GitOps.
Implementing Application Charts
The Git repository that ArgoCD watches is separate from the Terraform code. It contains only the Helm wrapper charts — one directory per application — and nothing else. This clean separation means developers working on application configuration never need to touch Terraform, and the platform team can evolve the infrastructure layer independently.
Structure
├── apps
│ ├── librechat
│ │ ├── Chart.yaml
│ │ └── values.yaml
│ └── ollama
│ ├── Chart.yaml
│ └── values.yaml
├── main.tf
├── values
│ └── argocd-values.yaml
└── variables.tf
ArgoCD Configuration (values/argocd-values.yaml)
global:
domain: argocd.glukas.space
server:
service:
type: ClusterIP
extraArgs:
- --insecure # HTTP only — no TLS (development only)
ingress:
enabled: true
ingressClassName: nginx
hosts:
- argocd.glukas.space
paths:
- /
pathType: Prefix
tls: []
config:
timeout.reconciliation: 180s # Polling interval
controller:
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
repoServer:
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
applicationSet:
enabled: true
notifications:
enabled: false
dex:
enabled: false
Critical parameters:
timeout.reconciliation: 180s
- Git polling frequency
- Trade-off: Lower value = faster detection, higher load
- Recommendation: 180s (3 min) for most cases
extraArgs: [--insecure]
- DEVELOPMENT ONLY
- Production must use TLS with valid certificates
resources.limits
- Controller: Component responsible for reconciliation
- RepoServer: Component that reads Git
- Values for ~10 applications; scale as needed
Ollama Configuration
# Chart.yaml
apiVersion: v2
name: ollama
description: Ollama deployment managed by ArgoCD
type: application
version: 1.0.0
dependencies:
- name: ollama
version: "1.42.0"
repository: "https://otwld.github.io/ollama-helm/"
Fields:
apiVersion: v2: Helm 3 API version
name: Wrapper chart name
type: application: Deployable chart (vs library)
version: Wrapper version (local control)
dependencies:
name: ollama: Dependency name
- Automatically creates namespace
.Values.ollama
version: "1.42.0": Fixed version of the upstream chart
- CRITICAL: Always pin the version
- Avoid:
version: "*"or no version
repository: Chart source
Ollama: values.yaml — Analysing its Double Hierarchy
ollama: # Layer 1: dependency namespace (created automatically by Helm)
ollama: # Layer 2: chart's internal namespace
gpu:
enabled: true
type: nvidia
number: 1
models:
pull:
- llama3.2:3b
- deepseek-r1:14b
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
hosts:
- host: ollama.glukas.space
paths:
- path: /
pathType: Prefix
tls: []
service:
type: ClusterIP
port: 11434
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
Why Double Hierarchy?
This is one of the most common silent failures when working with Helm wrapper charts. The configuration appears to apply correctly — ArgoCD reports Synced, no errors surface — but the behaviour you expected (GPU enabled, specific models pulled) simply does not materialise. Understanding the mechanism once prevents hours of debugging later.
Helm automatically creates a value namespace when processing a dependency:
dependencies:
- name: ollama
As a result, .Values.ollama is automatically created.
Upstream chart has an internal namespace:
helm show values https://otwld.github.io/ollama-helm/ollama
Output:
ollama: # Chart internal namespace
gpu:
enabled: false
Consequence:
Wrapper creates: .Values.ollama
Chart expects: .Values.ollama.xxx
Result: .Values.ollama.ollama.xxx
Technical rules:
- Every dependency creates a namespace
dependencies:
- name: X
Helm always creates .Values.X
- Some charts have an internal namespace
If the first line is the chart name, there is an internal namespace.
- Combination = Duplication
| Layer | Origin | Path |
|---|---|---|
| 1 | Wrapper | .Values.ollama |
| 2 | Chart | .Values.ollama |
| Final | - | .Values.ollama.ollama |
Validation:
# Verify internal namespace
helm show values <repo>/<chart>
# Render locally
helm template test apps/ollama/
# Search for specific configuration
helm template test apps/ollama/ | grep -A5 "nvidia.com/gpu"
Common mistake:
# ❌ INCORRECT (only one layer)
ollama:
gpu:
enabled: true
Result:
- Helm looks for:
.Values.ollama.ollama.gpu.enabled - Finds:
.Values.ollama.gpu.enabled - Uses default:
gpu.enabled: false - Symptom: No error, but GPU not enabled
Solution:
# ✅ CORRECT (double layer)
ollama:
ollama:
gpu:
enabled: true
Specific Configurations
GPU:
ollama:
ollama:
gpu:
enabled: true
type: nvidia # Alternativa: amd (ROCm)
number: 1
Generates:
resources:
limits:
nvidia.com/gpu: "1"
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
Models:
models:
pull:
- llama3.2:3b
- deepseek-r1:14b
Chart creates an init container:
initContainers:
- name: pull-models
command:
- /bin/sh
- -c
- |
ollama pull llama3.2:3b
ollama pull deepseek-r1:14b
Ingress annotations:
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
proxy-body-size: "0": No upload limit (models are large)
proxy-read-timeout: "600": 10min timeout (long inference)
proxy-send-timeout: "600": 10min timeout
Service and Resources (root layer):
Note: service and resources sit in the first ollama layer, not the second. The upstream chart does not expose these fields within its internal namespace, so they must be set at the dependency root level — not nested inside the chart's own namespace.
Full structure:
ollama: # dependency namespace
ollama: # chart's internal namespace
gpu: ...
models: ...
ingress: ...
service: ... # root level
resources: ... # root level
LibreChat Configuration
Chart.yaml
apiVersion: v2
name: librechat
description: LibreChat deployment managed by ArgoCD
type: application
version: 1.0.0
dependencies:
- name: librechat
version: "1.9.7"
repository: "oci://ghcr.io/danny-avila/librechat-chart"
OCI Repository:
repository: "oci://ghcr.io/danny-avila/librechat-chart"
Syntax: oci://<registry>/<owner>/<chart>
Differences vs. HTTP repository:
- Uses the same container registry infrastructure
- Faster pull
- Better integrated versioning
values.yaml
librechat: # Layer 1: dependency namespace
librechat: # Layer 2: chart's internal namespace (double hierarchy)
configEnv:
APP_TITLE: "LibreChat + Ollama (via Terraform)"
HOST: "0.0.0.0"
PORT: "3080"
SEARCH: "true"
MONGO_URI: "mongodb://librechat-mongodb:27017/LibreChat"
MEILI_HOST: "http://librechat-meilisearch:7700"
ALLOW_EMAIL_LOGIN: "true"
ALLOW_REGISTRATION: "true"
ALLOW_SOCIAL_LOGIN: "false"
ALLOW_SOCIAL_REGISTRATION: "false"
configYamlContent: |
version: 1.1.5
cache: true
endpoints:
custom:
- name: "Ollama"
apiKey: "ollama"
baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"
models:
default:
- "llama2:latest"
fetch: true
titleConvo: true
titleModel: "current_model"
modelDisplayLabel: "Ollama"
ingress:
enabled: true
className: "nginx"
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "25m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
hosts:
- host: librechat.glukas.space
paths:
- path: /
pathType: Prefix
tls: []
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1Gi"
cpu: "500m"
persistence:
enabled: true
size: 5Gi
storageClass: "standard"
replicaCount: 1
mongodb:
enabled: true
image:
registry: docker.io
repository: bitnami/mongodb
tag: "latest"
pullPolicy: IfNotPresent
auth:
enabled: false
persistence:
enabled: true
size: 8Gi
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1Gi"
cpu: "500m"
meilisearch:
enabled: true
auth:
enabled: false
environment:
MEILI_NO_ANALYTICS: "true"
MEILI_ENV: "development"
persistence:
enabled: true
size: 1Gi
resources:
requests:
memory: "128Mi"
cpu: "50m"
limits:
memory: "512Mi"
cpu: "250m"
Breaking down:
configEnv:
Environment variables converted to:
env:
- name: APP_TITLE
value: "LibreChat + Ollama (via Terraform)"
configYamlContent:
Multi-line YAML (pipe |) written into a ConfigMap and mounted as a file.
Helm processes:
configYamlContent: | ...
Creates:
apiVersion: v1
kind: ConfigMap
metadata:
name: librechat-config
data:
librechat.yaml: |
version: 1.1.5
cache: true
...
Mounts in:
volumeMounts:
- name: config
mountPath: /app/librechat.yaml
subPath: librechat.yaml
Sub-charts (mongodb, meilisearch):
Located in the first librechat layer, not the second.
Structure:
librechat: # dependency namespace
librechat: # chart's internal namespace
configEnv: ...
configYamlContent: ...
mongodb: ... # sub-chart (root level)
meilisearch: ... # sub-chart (root level)
ingress: ... # root level
resources: ... # root level
.gitignore
# Helm
charts/
Chart.lock
# Secrets
*-secrets.yaml
*.secret.yaml
# Backups
*.bak
*.tmp
# IDE
.vscode/
.idea/
*.swp
Prevents accidental commits of:
- Downloaded charts (regeneratable)
- Plaintext secrets
- Temporary files
Deployment
Prerequisites
# Cluster Minikube
minikube start \
--driver docker \
--container-runtime docker \
--gpus all \
--memory 8192 \
--cpus 4
minikube addons enable ingress
# Local DNS
echo "$(minikube ip) ollama.glukas.space" | sudo tee -a /etc/hosts
echo "$(minikube ip) librechat.glukas.space" | sudo tee -a /etc/hosts
echo "$(minikube ip) argocd.glukas.space" | sudo tee -a /etc/hosts
# Git repository created and populated
git clone https://github.com/usuario/k8s-apps.git
cd k8s-apps
# Copy Chart.yaml and values.yaml to apps/ollama/ and apps/librechat/
git add .
git commit -m "Initial commit"
git push origin main
Terraform: Configuration
# terraform.tfvars
cat > terraform.tfvars <<EOF
git_repo_url = "https://github.com/usuario/k8s-apps.git"
git_branch = "main"
jwt_secret = "$(openssl rand -hex 32)"
jwt_refresh_secret = "$(openssl rand -hex 32)"
creds_key = "$(openssl rand -hex 32)"
creds_iv = "$(openssl rand -hex 16)"
EOF
# .gitignore
echo "terraform.tfvars" >> .gitignore
Terraform: Init
cd terraform/
terraform init
Output:
Initializing provider plugins...
- Installing hashicorp/kubernetes v2.23.0...
- Installing hashicorp/helm v2.11.0...
Terraform has been successfully initialized!
Terraform: Plan
terraform plan -out=tfplan
Planned resources:
Plan: 7 to add, 0 to change, 0 to destroy.
Resources:
+ kubernetes_namespace.argocd
+ kubernetes_namespace.ollama
+ kubernetes_namespace.librechat
+ kubernetes_secret.librechat_credentials
+ helm_release.argocd
+ kubernetes_manifest.argocd_app_ollama
+ kubernetes_manifest.argocd_app_librechat
Terraform: Apply
terraform apply tfplan
Timeline:
[00:00-00:02] Namespaces created
[00:02-00:03] Secret created
[00:03-01:06] ArgoCD installed (Helm chart deployment)
[01:06-01:07] ArgoCD Applications registered (CRDs)
Apply complete! Resources: 7 added, 0 changed, 0 destroyed.
Note: Terraform only creates the platform. Apps will be deployed by ArgoCD.
ArgoCD: Initial Access
# Get password
ARGOCD_PASSWORD=$(minikube kubectl -- -n argocd get secret argocd-initial-admin-secret -o jsonpath='{.data.password}' | base64 -d)
echo "URL: http://argocd.glukas.space"
echo "User: admin"
echo "Pass: $ARGOCD_PASSWORD"
Access the UI:
Initial state:
Applications:
ollama Status: Syncing...
librechat Status: OutOfSync
ArgoCD: Sync Process
ArgoCD runs automatically:
1. Clone the repository:
git clone https://github.com/usuario/k8s-apps.git
git checkout main
2. Change detection:
Current SHA: abc123def456...
Last synced: (none - first sync)
Action: Sync required
3. Helm processing:
# For apps/ollama/
helm dependency build apps/ollama/
helm template ollama apps/ollama/ --values apps/ollama/values.yaml
# Generates YAML manifests
4. Application to the cluster:
kubectl apply -f <manifestos gerados>
5. Health checking:
Waiting:
- Pods: Ready
- Deployments: Available
- StatefulSets: Ready
Observable timeline:
# Terminal 2: Monitor Ollama
watch kubectl get pods -n ollama
# Output evolves as ArgoCD syncs:
NAME READY STATUS
ollama-xxx-yyy 0/1 Pending
ollama-xxx-yyy 0/1 ContainerCreating
ollama-xxx-yyy 0/1 Running # Init container: pulling models
ollama-xxx-yyy 1/1 Running # Ready (~2-3 min)
# Terminal 3: Monitor LibreChat
watch kubectl get pods -n librechat
# Output evolves:
NAME READY STATUS
librechat-mongodb-0 0/1 Pending
librechat-meilisearch-xxx 0/1 ContainerCreating
librechat-xxx-yyy 0/1 Pending
librechat-mongodb-0 1/1 Running # ~30s
librechat-meilisearch-xxx 1/1 Running # ~25s
librechat-xxx-yyy 1/1 Running # ~1 min
After 3–5 minutes, ArgoCD UI shows:
Verification
# Ollama
curl http://ollama.glukas.space/api/tags
{
"models": [
{"name": "llama3.2:3b", ...},
{"name": "deepseek-r1:14b", ...}
]
}
Operations
These five workflows cover the full operational lifecycle under GitOps: making a configuration change, upgrading a chart version, rolling back a broken deploy, understanding self-healing behaviour, and managing multiple environments. In every case the pattern is the same — edit files, push to Git, let ArgoCD do the rest. No kubectl apply, no terraform apply, no manual intervention required.
Workflow 1: Modify Configuration (Add a Model)
Objective: Add the mistral:latest model to Ollama.
Process:
# 1. Clone and branch
git clone https://github.com/usuario/k8s-apps.git
cd k8s-apps
git checkout -b add-mistral
# 2. Edit
vim apps/ollama/values.yaml
# Modify:
models:
pull:
- llama3.2:3b
- deepseek-r1:14b
- mistral:latest # Added
# 3. Commit
git add apps/ollama/values.yaml
git commit -m "feat(ollama): Add mistral model"
# 4. Push
git push origin add-mistral
Pull Request:
- Create PR on GitHub/GitLab
- Visible diff:
models:
pull:
- llama3.2:3b
- deepseek-r1:14b
+ - mistral:latest
- Review and approval
- Merge to main
Automatic ArgoCD:
Timeline after merge:
[T+0 min] Merge to main
[T+0-3 min] ArgoCD polling (waiting for next cycle)
[T+3 min] ArgoCD detects new SHA
[T+3 min] Calculates diff: + models.pull: mistral:latest
[T+3 min] Helm upgrade ollama...
[T+4 min] Rolling update initiated
[T+4-7 min] Init container: ollama pull mistral:latest
[T+7 min] New pod Ready
[T+7 min] Old pod Terminated
[T+7 min] ArgoCD status: Synced ✓
Total time: ~7 minutes from merge to deploy.
Note: At no point did the developer execute any command against the cluster directly — the entire deployment was driven by a Git push.
Workflow 2: Chart Version Upgrade
Objective: Upgrade LibreChat from 1.9.7 to 1.10.0.
git checkout -b upgrade-librechat
vim apps/librechat/Chart.yaml
# Modify:
dependencies:
- name: librechat
version: "1.10.0" # Era 1.9.7
git commit -am "chore(librechat): Upgrade to v1.10.0"
git push origin upgrade-librechat
CI/CD (optional):
# .github/workflows/helm-lint.yml
name: Helm Lint
on: pull_request
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: azure/setup-helm@v3
- name: Lint
run: |
helm lint apps/*/
- name: Template Test
run: |
helm template test apps/*/ > /dev/null
Pipeline runs:
- Helm lint (syntax validation)
- Template rendering (detects errors)
After approval and merge, ArgoCD deploys automatically.
Workflow 3: Rollback
Scenario: New deploy caused a problem in production.
Option 1: Git Revert
# View commits
git log --oneline apps/librechat/
# def456 chore(librechat): Upgrade to v1.10.0
# abc123 feat(ollama): Add mistral
# Revert
git revert def456
git push origin main
ArgoCD detects and applies the revert automatically.
Timeline: ~3–5 minutes.
Option 2: ArgoCD UI
1. Open http://argocd.glukas.space
2. Select the "librechat" application
3. Open the "History" tab
4. Sync list:
Sync 5: def456 (current) ❌
Sync 4: abc123 ✅
5. Click on Sync 4
6. Click "Rollback"
7. Confirm
Timeline: ~30 segundos.
Important: Rollback via UI is temporary. The next poll will re-sync with Git. For permanence, perform a git revert.
Option 3: ArgoCD CLI
# Install CLI
brew install argocd # ou método apropriado
# Login
argocd login argocd.glukas.space --username admin
# View history
argocd app history librechat
# Rollback
argocd app rollback librechat <REVISION>
Workflow 4: Self-Healing
Scenario: Manual change in the cluster.
# Someone runs:
kubectl scale deployment ollama --replicas=3 -n ollama
ArgoCD response:
[T+0s] kubectl scale executed
[T+0s] Deployment: replicas=3
[T+0-180s] ArgoCD polling interval
[T+180s] ArgoCD detects drift:
Git: replicas=1
Cluster: replicas=3
[T+181s] Self-heal triggered
kubectl apply -f deployment.yaml (from Git)
[T+182s] Kubernetes: replicas=1
3 extra pods terminated
[T+183s] ArgoCD status: Synced ✓
Event: "Self-healed: ollama deployment"
Manual change was automatically reverted.
Responsible configuration:
syncPolicy = {
automated = {
selfHeal = true # This parameter enables automatic revert
}
}
Disable self-heal:
syncPolicy = {
automated = {
prune = true
selfHeal = false # Allows manual changes to persist
}
}
Workflow 5: Multi-Environment
Structure:
k8s-apps/
├── apps/
│ └── ollama/
│ ├── Chart.yaml
│ ├── values-dev.yaml
│ ├── values-staging.yaml
│ └── values-prod.yaml
Differentiated values:
# values-dev.yaml
ollama:
ollama:
models:
pull:
- llama3.2:3b # Lightweight model only
resources:
limits:
memory: "2Gi"
# values-prod.yaml
ollama:
ollama:
models:
pull:
- llama3.2:3b
- deepseek-r1:14b
- mistral:latest
resources:
limits:
memory: "8Gi"
ArgoCD Applications (Terraform):
# Dev
resource "kubernetes_manifest" "argocd_app_ollama_dev" {
manifest = {
spec = {
source = {
repoURL = var.git_repo_url
targetRevision = "develop" # develop branch
path = "apps/ollama"
helm = {
valueFiles = ["values-dev.yaml"]
}
}
destination = {
namespace = "ollama-dev"
}
}
}
}
# Prod
resource "kubernetes_manifest" "argocd_app_ollama_prod" {
manifest = {
spec = {
source = {
repoURL = var.git_repo_url
targetRevision = "main" # main branch
path = "apps/ollama"
helm = {
valueFiles = ["values-prod.yaml"]
}
}
destination = {
namespace = "ollama-prod"
}
}
}
}
Promotion flow:
Feature branch → develop (PR) → auto-deploy to Dev
→ staging (PR) → auto-deploy to Staging
→ main (PR + approvals) → auto-deploy to Prod
Git branches map to environments.
Troubleshooting
This section covers the most common failure modes when running ArgoCD in practice — what they look like, why they happen, and how to fix them.
Problem 1: Application OutOfSync
An OutOfSync status means ArgoCD has detected a difference between what's in Git and what's running in the cluster, but hasn't been able to resolve it. This is usually the first sign that something went wrong during a sync — not necessarily a cluster problem, but worth investigating immediately.
Symptom:
kubectl get application -n argocd
NAME SYNC STATUS HEALTH STATUS
ollama OutOfSync Unknown
Diagnosis:
# Describe Application
kubectl describe application ollama -n argocd
# View events
kubectl get events -n argocd --sort-by='.lastTimestamp'
# repo-server logs
kubectl logs -n argocd deployment/argocd-repo-server
# application-controller logs
kubectl logs -n argocd statefulset/argocd-application-controller
Common causes:
- YAML syntax error
Error: YAML parse error line 15: mapping values are not allowed here
Solution: Fix syntax in values.yaml
- Chart version not found
Error: chart "ollama" version "1.42.0" not found
Solution: Check available versions:
helm search repo ollama --versions
- Repository unreachable
Error: failed to fetch https://github.com/usuario/k8s-apps.git: authentication required
Solution: Configure credentials in ArgoCD
Local validation:
# Test template rendering
cd k8s-apps/
helm dependency build apps/ollama/
helm template test apps/ollama/
# If there's an error, it will appear here
Problem 2: Pods CrashLoopBackOff
A CrashLoopBackOff means the pod is starting, failing, and being restarted repeatedly. ArgoCD may show the application as Synced — meaning the deployment was applied correctly — but Degraded on health, because the pod never reaches a running state. The problem is almost always in the container itself, not in ArgoCD.
Symptom:
kubectl get pods -n ollama
NAME READY STATUS RESTARTS
ollama-xxx 0/1 CrashLoopBackOff 5
Diagnosis:
# Current pod logs
kubectl logs -n ollama ollama-xxx
# Previous container logs (if it has restarted)
kubectl logs -n ollama ollama-xxx --previous
# Pod events and conditions
kubectl describe pod -n ollama ollama-xxx
Common causes:
- GPU not available
Error: failed to initialize NVML: could not load NVML library
Solution:
# Temporarily disable GPU
ollama:
ollama:
gpu:
enabled: false
- Insufficient memory
Error: OOMKilled
Solution:
resources:
limits:
memory: "8Gi" # Increase
- Model does not exist
Error: pulling model: model 'llama4' not found
Solution: Check model name on values.yaml
Problem 3: Double Hierarchy Not Applied
This is one of the trickier failure modes because ArgoCD reports everything as healthy — the sync succeeded, no errors are visible, but the configuration simply isn't taking effect. It typically happens when the Helm values file is missing one level of nesting, causing the GPU settings to silently fall back to defaults.
Symptom:
- ArgoCD shows Synced
- GPU not enabled
- No visible errors
Diagnosis:
# Render full template
helm template test apps/ollama/
# Search for GPU configuration
helm template test apps/ollama/ | grep -A10 "nvidia.com/gpu"
# If not found, structure is wrong
Cause:
# ❌ Incorrect structure (one layer)
ollama:
gpu:
enabled: true
Solution:
# ✅ Correct structure (double layer)
ollama:
ollama:
gpu:
enabled: true
Validation:
# After correction, check diff in ArgoCD UI
# Should show change in spec.template.spec.containers[].resources
Problem 4: Slow Sync
Unlike the previous problems, slow sync isn't a failure — it's expected behavior that becomes surprising when you first encounter it. ArgoCD doesn't watch Git in real time; it polls on a fixed interval, so there will always be a delay between a git push and a deployment.
Symptom:
ArgoCD takes >5 minutes to detect changes.
Cause:
Default polling interval is 3 minutes.
Solution 1: Adjust polling
# values/argocd-values.yaml
server:
config:
timeout.reconciliation: 60s # 1 minute
Trade-off: More load on the cluster and Git repo.
Solution 2: Webhook
Configure a webhook in Git to notify ArgoCD:
# GitHub webhook URL
POST https://argocd.glukas.space/api/webhook
ArgoCD syncs immediately upon receiving a push.
Solution 3: Manual sync
# Via CLI
argocd app sync ollama
# Via UI
Click "Sync" on the application
Ch. 3 vs Ch. 4: When to Use Each
Both approaches are valid, and the right choice depends on team size, deploy frequency, and how much operational overhead you want to absorb upfront. The table below maps the key trade-offs to help you decide:
| Cap 3 (Terraform + Helm) | Cap 4 (Terraform + ArgoCD) | |
|---|---|---|
| Deploy trigger | Manual: terraform apply
|
Automatic: Git push |
| Latency | Immediate | 3 min (polling) |
| Reconciliation | Manual: terraform plan
|
Continuous: 3-min loop |
| Drift detection | Manual | Automatic |
| Self-healing | Does not exist | Configurable (selfHeal) |
| Rollback |
git revert + terraform apply
|
ArgoCD UI (1 click) or git revert
|
| Audit trail | Git + Terraform logs | Git + ArgoCD events |
| Multi-env | Duplicate code or workspaces | Branches + valueFiles |
| Permissions required | kubectl + Terraform | Git only |
| Disaster recovery | Re-run Terraform | Automatic ArgoCD re-sync |
| State management | Terraform state (central) | Git (distributed) |
| Initial complexity | Medium | High |
| Scalability (apps) | ~20 apps | Unlimited |
| Ideal team size | 1–10 | 10+ |
Chapter 3's approach is simpler to set up and sufficient for small teams with controlled deploy cadences — if a weekly terraform apply is acceptable, the added complexity of ArgoCD is not justified. Chapter 4 becomes the right choice once teams grow, deploy frequency increases, or compliance requirements demand an immutable audit trail and automated drift correction. The two are not mutually exclusive: many organisations start with Chapter 3 and migrate to Chapter 4 as their operational maturity grows.
Conclusion
Chapters 1 through 4 trace a deliberate progression — from manual kubectl commands to a fully automated, self-healing platform. Each chapter addressed a specific limitation of the one before it: verbosity, the need for manual execution, the absence of continuous reconciliation. The cumulative result is an architecture where Git is the single source of truth, and the cluster enforces that truth on its own.
The four GitOps principles are not just theoretical framing — each one translates directly into an operational guarantee. Declarative configuration means the desired state is always readable and auditable without touching the cluster. Version control means every change has an author, a rationale, and a rollback path. Pull-based deployment means no external system ever needs credentials to reach the cluster — the cluster reaches out to Git. Continuous reconciliation means drift is detected and corrected automatically, without anyone noticing or reacting.
The architecture also enforces a clean separation of concerns that makes each layer independently replaceable:
Terraform → Platform bootstrap (namespaces, secrets, ArgoCD)
Git → Application desired state
ArgoCD → Reconciliation engine
Helm → Packaging and templating
Changes to one layer do not cascade into the others. You could swap Helm for raw manifests, or replace Terraform with a different provisioner, without touching ArgoCD or the Git repository structure.
This foundation is deliberately extensible. The next steps — security, observability, multi-tenancy — build on top of it without requiring the core architecture to change.
Maturity Journey
Each chapter in this series represents a deliberate step up the maturity ladder — not just in tooling, but in ownership model, speed, and scale:
Stage 1: Manual Deployment (Ch. 1)
Maturity: Ad-hoc
Ownership: Individuals
Speed: Slow (days/weeks)
Scale: Doesn't scale
Stage 2: Infrastructure as Code (Ch. 2–3)
Maturity: Repeatable
Ownership: Ops team
Speed: Medium (hours/days)
Scale: Limited (manual execution)
Stage 3: GitOps Foundation (Ch. 4) ← We are here
Maturity: Automated
Ownership: Shared (platform + dev)
Speed: Fast (minutes/hours)
Scale: Good (self-service enabled)
Stage 4: Infrastructure as Product (Next Steps)
Maturity: Product-driven
Ownership: Platform teams (product owners)
Speed: Very fast (minutes)
Scale: Excellent (true self-service)
Metrics: DORA, satisfaction, adoption
What Comes Next
Stage 3 is a foundation, not a destination. The architecture built in this chapter is intentionally minimal — one team, two applications, one cluster — and that is the right place to start. But the same GitOps primitives that make this setup work at small scale are exactly what allow it to grow.
The diagram below shows the current state: a single developer workflow, a flat namespace structure, and ArgoCD managing two specific workloads with no shared services, no multi-tenancy, and no separation between platform concerns and application concerns.
The target looks substantially different. The cluster is split into two distinct layers: a Platform Layer of shared services — security, observability, secrets management — owned by a dedicated platform team with SLAs and roadmaps; and a Workload Layer where individual product teams deploy independently via git push, without ever touching the platform layer beneath them.
The gap between the two diagrams is not a rewrite — it is an incremental build. Every component in the Platform Layer gets added as an ArgoCD-managed application in its own namespace, following the exact same wrapper-chart pattern introduced in this chapter. The core architecture does not change; it simply gains more managed services over time.
The next chapters will build out this platform layer starting with the highest-impact additions: security, observability, and secrets management.
| Initiative | Domain | Phase | Complexity | Impact | Dependencies | Time | Priority |
|---|---|---|---|---|---|---|---|
| Pomerium | SECURITY | Foundation | Intermediate | High | ArgoCD | 3-5d | P0 |
| Sealed Secrets | SECURITY | Foundation | Basic | High | None | 1d | P0 |
| Authentik | SECURITY | Foundation | Intermediate | High | PostgreSQL | 3-5d | P0 |
| Prometheus + Grafana | OBSERVABILITY | Foundation | Intermediate | High | None | 3-5d | P0 |
| MCP Servers | INTEGRATION | Foundation | Intermediate | High | None | 2-3d | P0 |
| RAG (Qdrant) | AI/LLM | Foundation | Advanced | High | None | 1w | P1 |
| LangSmith/Langfuse | AI/LLM | Scale | Advanced | High | Prometheus | 5-7d | P1 |
| Autoscaling | INFRA | Scale | Intermediate | High | Prometheus | 2-3d | P1 |
| Loki | OBSERVABILITY | Scale | Basic | Medium | Grafana | 1-2d | P1 |
| SearXNG | INTEGRATION | Scale | Basic | Medium | None | 1d | P1 |
| Web Scraper | INTEGRATION | Scale | Intermediate | Medium | None | 2d | P1 |
| Tilt | DEVEX | Scale | Basic | Medium | None | 1d | P1 |
| Jaeger | OBSERVABILITY | Production Excellence | Advanced | Medium | Prometheus | 3-5d | P2 |
| Model Registry | AI/LLM | Production Excellence | Intermediate | Medium | None | 3-5d | P2 |
| Multi-region | NETWORK | Production Excellence | Expert | Medium | ArgoCD | 2w+ | P3 |
| Fine-tuning | AI/LLM | Production Excellence | Expert | Low | Registry | 2w | P3 |
Platform Products (Shared Services):
Pomerium + Authentik:
product: "Authentication & Authorization Platform"
customers: "All applications"
value: "SSO, MFA, zero-trust"
sla: "99.9% uptime, <200ms auth latency"
roadmap: ["RBAC granular", "SAML support", "API keys"]
Prometheus + Grafana + Loki:
product: "Observability Platform"
customers: "All teams (dev + ops)"
value: "Unified metrics/logs/traces"
sla: "30d retention, <5s query time"
roadmap: ["AIOps", "Cost attribution", "SLO management"]
Sealed Secrets:
product: "Secrets Management Platform"
customers: "All teams"
value: "Git-native secrets, rotation, audit"
sla: "Zero exposure, <1min sync"
roadmap: ["Vault integration", "RBAC", "Expiration"]
Workload-Specific Products:
RAG (Qdrant):
product: "Vector Search Service"
customers: "AI/ML teams"
value: "Semantic search, embeddings"
sla: "<100ms p95 search latency"
roadmap: ["Multi-model", "Hybrid search"]
MCP Servers:
product: "Tool Integration Platform"
customers: "LLM applications"
value: "Connect LLMs to tools"
sla: "<50ms tool invocation"
roadmap: ["Custom tools", "Async execution"]
Developer Experience Products:
Tilt:
domain: "[DEVEX]"
phase: "Scale"
complexity: "Basic"
impact: "Medium"
dependencies: ["None"]
time: "1 day"
priority: "P1"
product: "Local Development Platform"
customers: "All developers"
value: "Hot-reload, real K8s environment, fast iteration"
sla: "<5s code sync, <10s service restart"
roadmap: ["Remote development", "Debugging tools", "Resource snapshots"]
Infrastructure Products:
Multi-region:
domain: "[NETWORK]"
phase: "Production Excellence"
complexity: "Expert"
impact: "Medium"
dependencies: ["ArgoCD"]
time: "2+ weeks"
priority: "P3"
product: "Global Load Balancing & Geo-distribution"
customers: "All production workloads"
value: "Low latency worldwide, compliance (data residency)"
sla: "99.99% global availability, <100ms cross-region failover"
roadmap: ["Active-active", "Traffic shaping", "Cost optimization", "DR automation"]
Production Recommended Extensions
- TLS/HTTPS:
# argocd-values.yaml
server:
ingress:
tls:
- secretName: argocd-tls
hosts:
- argocd.empresa.com
- SSO/OIDC:
server:
config:
url: https://argocd.empresa.com
oidc.config: |
name: Okta
issuer: https://empresa.okta.com
clientID: $oidc.okta.clientId
clientSecret: $oidc.okta.clientSecret
- RBAC:
server:
rbacConfig:
policy.csv: |
p, role:developers, applications, get, */*, allow
p, role:developers, applications, sync, */*, allow
g, developers-group, role:developers
- Notifications:
notifications:
enabled: true
notifiers:
service.slack: |
token: $slack-token
templates:
template.app-deployed: |
message: Application {{.app.metadata.name}} deployed
triggers:
trigger.on-deployed: |
- when: app.status.operationState.phase in ['Succeeded']
send: [app-deployed]
- Application Sets:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: cluster-apps
spec:
generators:
- git:
repoURL: https://github.com/empresa/k8s-apps.git
revision: HEAD
directories:
- path: apps/*
template:
metadata:
name: '{{path.basename}}'
spec:
source:
repoURL: https://github.com/empresa/k8s-apps.git
path: '{{path}}'
destination:
server: https://kubernetes.default.svc
namespace: '{{path.basename}}'
Monitoring:
# ServiceMonitor para Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argocd-metrics
spec:
selector:
matchLabels:
app.kubernetes.io/name: argocd-metrics
endpoints:
- port: metrics
Technical Resources
References:
- Straube, S. (2025). "Infrastructure as a Product: The Key to Agile and Scalable IT". Medium/Elevate Tech.
- Griffiths, M. (2021). "Infrastructure as Product: Accelerating time to market through platform engineering". Thoughtworks Insights.
- Strope, L. (2026). "Why Infrastructure Is Becoming Product And How to Capitalize". Akava.
Documentation:
Reference Repositories:
Auxiliary Tools:
-
argocdCLI -
kubectl-argo-rollouts(progressive delivery) -
argocd-notifications(alerts) -
argocd-image-updater(auto-update images)
End of Chapter 4





Top comments (0)