Darian Vance

Posted on Jan 1 • Originally published at wp.me

Solved: I built an automated Talos + Proxmox + GitOps homelab starter (ArgoCD + Workflows + DR)

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: This blog post solves the problem of manual, inconsistent, and fragile homelab setups by detailing an automated, resilient system. It integrates Talos Linux, Proxmox, and a GitOps approach using ArgoCD and Argo Workflows for infrastructure provisioning, application management, and strategic disaster recovery.

🎯 Key Takeaways

Proxmox VE combined with Talos Linux forms a robust, API-driven infrastructure base for automated VM provisioning and a secure, immutable Kubernetes operating system.
ArgoCD implements a GitOps strategy, ensuring continuous synchronization of Kubernetes cluster configurations and applications from a Git repository, preventing configuration drift and enabling automated deployments.
Argo Workflows orchestrates complex operational tasks like automated backups (Proxmox VMs via PBS, Kubernetes apps via Velero) and disaster recovery testing, significantly enhancing homelab resilience and recovery capabilities.

Building a robust, automated homelab or small-scale IT environment presents unique challenges. This post details how integrating Talos Linux, Proxmox, and a GitOps approach with ArgoCD, Argo Workflows, and strategic Disaster Recovery (DR) can transform a manual, fragile setup into an resilient, self-healing system.

Symptoms: The Homelab Headache

Many IT professionals building or maintaining homelabs face a common set of frustrations that hinder scalability, reliability, and efficient management. These symptoms often stem from a lack of automation and a reactive approach to infrastructure.

### Manual VM and Kubernetes Provisioning

Provisioning new virtual machines on hypervisors like Proxmox, then manually installing an OS, configuring networking, and bootstrapping a Kubernetes cluster, is incredibly time-consuming and prone to human error. Each node becomes a snowflake, making consistency impossible.

### Configuration Drift and Inconsistency

Environments quickly diverge from their intended state. Manual changes to VMs, Kubernetes manifests, or network configurations lead to inconsistencies across nodes, making troubleshooting difficult and deployments unreliable. The desired state is rarely codified and enforced.

### Lack of Automated Deployments and Updates

Deploying new applications, updating services, or even patching the underlying operating system often involves manual SSH sessions, script execution, or dashboard clicks. This process is slow, inefficient, and often leads to downtime or unexpected failures.

### Fragile Disaster Recovery (DR) Strategy

Without a clear, automated DR plan, a single hardware failure or misconfiguration can lead to significant data loss or extended service outages. Manual backups are often outdated, and the recovery process is untested, complex, and time-consuming.

### Operational Burden of Kubernetes

While powerful, managing Kubernetes itself adds overhead. Keeping the control plane healthy, nodes updated, and applications resilient requires constant vigilance. Without automation, the operational complexity can quickly overwhelm a homelab enthusiast.

Solution 1: Proxmox + Talos for a Robust & Minimalist Infrastructure Base

The foundation of a reliable homelab begins with a solid, automated infrastructure layer. This solution combines Proxmox VE for virtualization with Talos Linux for a secure, minimal, and immutable Kubernetes operating system.

Proxmox VE: The Virtualization Workhorse

Proxmox VE provides a powerful, open-source platform for managing virtual machines, containers, and storage. Its API-driven nature makes it an ideal candidate for infrastructure automation, allowing you to programmatically provision VMs rather than relying on manual GUI clicks.

Example: Automating VM Provisioning (Conceptual)

While full Terraform configurations are extensive, the principle involves using Proxmox’s API or tools like qm to create VMs based on templates. Imagine a script that defines your Kubernetes nodes:

# Example: Basic VM creation using qm (simplified for illustration)
# This would typically be wrapped in a script or Terraform module
# with dynamic parameters.

VMID="101"
VMNAME="talos-node-01"
MEM="4096" # 4GB RAM
CPUS="2"
DISK_SIZE="32G"
ISO_STORAGE="local:iso" # Storage for ISO images
OS_TYPE="l26" # Linux 2.6+ kernel
NET_BRIDGE="vmbr0" # Network bridge

# Create the VM
qm create $VMID --name $VMNAME --memory $MEM --cores $CPUS --ostype $OS_TYPE

# Add a storage device (e.g., raw disk from local storage)
# This example assumes a pre-existing storage pool named 'local-lvm'
qm set $VMID --scsihw virtio-scsi-pci --scsi0 local-lvm:$DISK_SIZE

# Add a network device
qm set $VMID --net0 virtio,bridge=$NET_BRIDGE

# Mount a Cloud-Init CD-ROM for initial configuration (crucial for automation)
# The Cloud-Init content would contain the Talos installer command
qm set $VMID --ide2 local:cloudinit --boot order=ide2

# Configure boot order to boot from the Cloud-Init ISO first, then disk
qm set $VMID --boot order="ide2;scsi0"

# Start the VM (this is where Cloud-Init would kick in and install Talos)
qm start $VMID

Talos Linux: Kubernetes-Native OS

Talos Linux is a secure, minimal, and immutable operating system designed specifically for running Kubernetes. It eliminates unnecessary components, reducing the attack surface and operational overhead. Its API-driven management model aligns perfectly with a GitOps approach.

**Minimal Footprint:** No shell, no package manager, no unnecessary services.
**Immutability:** The OS never drifts; all changes are applied via atomic updates.
**API-Driven:** All configuration and operations are performed via a gRPC API, making it ideal for automation.
**Enhanced Security:** Reduced attack surface and cryptographic integrity checks.

Example: Generating Talos Configuration

After provisioning your VMs, you generate a Talos configuration that bootstraps your Kubernetes cluster. This configuration defines your control plane and worker nodes, their IP addresses, and essential Kubernetes settings. You then provision this configuration to your VMs (e.g., via Cloud-Init or directly using talosctl).

# Generate an initial Talos configuration
# Replace IPs with your actual cluster IPs
# --with-kubespan enables encryption for inter-node communication
talosctl gen config my-talos-cluster https://192.168.1.10:6443 \
    --control-plane 192.168.1.10,192.168.1.11,192.168.1.12 \
    --workers 192.168.1.13,192.168.1.14 \
    --output ./cluster-configs \
    --with-kubespan

# Example of applying the configuration (e.g., via Cloud-Init user data)
# The output `controlplane.yaml` and `worker.yaml` are the configurations
# that would be used to install Talos on the respective nodes.
# You might `base64` encode this content for Cloud-Init.

# To install Talos after boot (e.g., from a live ISO or Cloud-Init)
# On a control plane node:
# talosctl apply-config --nodes 192.168.1.10 --file ./cluster-configs/controlplane.yaml --preserve-client-id --wait

# On a worker node:
# talosctl apply-config --nodes 192.168.1.13 --file ./cluster-configs/worker.yaml --preserve-client-id --wait

Solution 2: GitOps with ArgoCD for Automated Configuration Management

Once your infrastructure is provisioned, GitOps takes over to manage the desired state of your Kubernetes cluster and applications. ArgoCD serves as the engine, continuously synchronizing your cluster with configurations stored in a Git repository, ensuring consistency and preventing configuration drift.

GitOps Principles

**Declarative:** The desired state of your infrastructure and applications is declared in Git (e.g., YAML manifests).
**Version Controlled:** All changes are committed to Git, providing an auditable history and easy rollbacks.
**Automated:** Changes in Git automatically trigger updates in the cluster.
**Reconciled:** A controller continuously observes the cluster’s actual state and reconciles it with the desired state in Git.

ArgoCD: The GitOps Controller

ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes. It automatically synchronizes the state of applications from a Git repository to a Kubernetes cluster.

Key Features:

**Automated Sync:** Keeps your cluster’s applications in sync with your Git repo.
**Rollback/Roll-forward:** Easy to revert to previous states or deploy new versions.
**Health Monitoring:** Provides visibility into the health of your deployed applications.
**Multi-cluster Support:** Manage applications across multiple Kubernetes clusters.

Example: Deploying an Application with ArgoCD

First, you install ArgoCD into your Talos Kubernetes cluster (e.g., via a Helm chart or direct manifest application). Then, you define an ArgoCD Application resource in your Git repository. This resource points ArgoCD to where your application’s Kubernetes manifests are stored.

# Example: Git repository structure
# my-homelab-gitops/
# ├── infrastructure/
# │   └── talos/
# │       └── cluster-config-patches/
# ├── applications/
# │   ├── nginx-hello-world/
# │   │   ├── deployment.yaml
# │   │   └── service.yaml
# │   └── argocd/
# │       └── application-nginx.yaml
# └── argocd-apps/
#     └── homelab-infra.yaml
#     └── homelab-apps.yaml

# applications/argocd/application-nginx.yaml
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: nginx-hello-world
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/my-homelab-gitops.git # Your Git repository
    targetRevision: HEAD # Or a specific branch/tag like 'main'
    path: applications/nginx-hello-world # Path within the repo to the manifests
  destination:
    server: https://kubernetes.default.svc # The target cluster
    namespace: default # The target namespace for the application
  syncPolicy:
    automated:
      prune: true # Delete resources that are no longer in Git
      selfHeal: true # Revert any manual changes to match Git state
    syncOptions:
      - CreateNamespace=true # Automatically create the namespace if it doesn't exist

Once this Application manifest is committed to your Git repository and ArgoCD is configured to sync from that repository, ArgoCD will automatically deploy and manage the nginx-hello-world application in your cluster. Any changes to deployment.yaml or service.yaml in Git will be automatically applied by ArgoCD.

Solution 3: Argo Workflows & Integrated DR for Operational Automation & Resilience

Beyond declarative application deployments, operational tasks like automated backups, DR testing, and complex multi-step processes still require orchestration. Argo Workflows, combined with a robust DR strategy, ensures your homelab is not just automated but also resilient.

Argo Workflows: The Workflow Engine

Argo Workflows is an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. It’s ideal for tasks that require sequential steps, conditional logic, or parallel execution, such as CI/CD pipelines, data processing, or, crucially, operational automation and DR.

Use Cases in a Homelab:

**Automated Backups:** Triggering Proxmox VM backups, Kubernetes application backups (Velero).
**DR Testing:** Periodically spinning up a test environment, restoring backups, and validating service functionality.
**Infrastructure Provisioning:** Orchestrating the creation of new Talos nodes on Proxmox.
**Application Release Pipelines:** Orchestrating complex deployments that involve pre-hooks, post-hooks, and external integrations.

Example: Conceptual Backup Workflow

This workflow outlines a conceptual plan to back up both Proxmox VMs and Kubernetes applications.

# Example: Argo Workflow for Homelab Backup Strategy
# This is a conceptual workflow, specific commands and client tools
# (e.g., proxmox-backup-client, velero CLI) would need to be in your container images.
---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: homelab-backup-
spec:
  entrypoint: backup-strategy

  templates:
  - name: backup-strategy
    dag:
      tasks:
      - name: backup-proxmox-vms
        template: backup-proxmox
      - name: backup-kubernetes-apps
        template: backup-velero
        dependencies: # Run K8s backup after Proxmox backup starts, or in parallel
        - backup-proxmox-vms

  - name: backup-proxmox
    container:
      image: your-custom-backup-image:latest # Image with Proxmox API client or PBS client
      command: ["/bin/sh", "-c"]
      args:
        - |
          echo "Starting Proxmox VM backups..."
          # Example: Triggering a backup via Proxmox API or proxmox-backup-client
          # You'd need credentials/API tokens mounted as secrets
          proxmox-backup-client backup --vm 101 --repository my-pbs-repo
          proxmox-backup-client backup --vm 102 --repository my-pbs-repo
          echo "Proxmox VM backups complete."

  - name: backup-velero
    container:
      image: velero/velero:latest # Official Velero image
      command: ["/bin/sh", "-c"]
      args:
        - |
          echo "Starting Kubernetes application backups with Velero..."
          # Assumes Velero is already installed in the cluster and configured with a backup location
          velero backup create k8s-apps-$(date +%Y%m%d%H%M%S) --include-namespaces '*' --default-volumes-to-restic
          echo "Kubernetes application backups complete."

Integrated Disaster Recovery (DR)

A true DR strategy for a GitOps-driven homelab integrates several components:

**Infrastructure as Code:** Your entire Proxmox + Talos setup is defined in Git. In a disaster, you can rebuild your hypervisor and then redeploy Talos nodes from scratch.
**ArgoCD for Applications:** ArgoCD ensures all your Kubernetes applications can be quickly restored by syncing their desired state from Git to a new or recovered cluster.
**Proxmox Backup Server (PBS):** For hypervisor-level VM backups. Critical for stateful applications running directly on VMs or for restoring the base OS of Talos nodes if not using full infrastructure as code for OS deployment.
**Velero:** For Kubernetes-native application backups, including persistent volumes and Kubernetes resource manifests.
**Argo Workflows for Orchestration:** Automating the recovery process, from provisioning VMs to restoring backups and verifying service health.

Feature	Manual DR Strategy	Automated GitOps DR Strategy
Recovery Time Objective (RTO)	High (hours to days)	Low (minutes to hours)
Recovery Point Objective (RPO)	Can be high (depends on last manual backup)	Low (frequent automated backups)
Consistency	Highly variable, prone to human error	High, enforced by Git and automated processes
Testing	Infrequent, complex, disruptive	Frequent, automated, non-disruptive (sandbox recovery)
Infrastructure Recovery	Manual re-creation of VMs, OS installation	Automated provisioning from IaC (e.g., Terraform/Ansible for Proxmox, Talos configs)
Application Recovery	Manual re-deployment, configuration, data restore	ArgoCD auto-sync, Velero restore from object storage
Complexity	High for large environments, difficult to scale	High initial setup, low ongoing maintenance and recovery
Cost (Operational)	High in labor, potential for extended downtime	Lower in labor, quicker recovery, reduced business impact

Conclusion

By adopting a comprehensive strategy leveraging Proxmox for virtualization, Talos Linux for a minimalist Kubernetes OS, and a robust GitOps workflow driven by ArgoCD and Argo Workflows for automation and DR, you can transform your homelab. This approach minimizes manual intervention, ensures consistency, enhances security, and provides a clear, automated path to recovery from disaster. The initial investment in setting up these systems pays dividends in stability, scalability, and peace of mind, allowing you to focus on experimentation and innovation rather than constant firefighting.