andygolubev

Posted on Jun 28

Building an Offline AI Platform with K3s, Ansible, Argo CD, vLLM, and NVIDIA GPU

#kubernetes #ansible #ai #devops

Introduction

Running Kubernetes in the cloud is straightforward when every node can reach package repositories, container registries, GitHub, and model hubs. The task becomes much more interesting when the target server has no internet connection at all.

For this proof of concept, I wanted to build a self-contained AI platform on a single Ubuntu server. The final environment had to include:

K3s
NVIDIA GPU support
vLLM with a locally stored Qwen model
Argo CD and a local GitOps repository
A FastAPI and LangChain chatbot
Prometheus, Grafana, Loki, Tempo, and OpenTelemetry
k9s for local operations

The target machine is an Ubuntu 26.04 AMD64 server with an NVIDIA A10G GPU. I used an EC2 g5.2xlarge instance for validation, but AWS is only a test harness. The installer itself does not provision infrastructure and does not depend on AWS.

The important requirement was simple: after copying the bundle to the server, the installation must not make any external network request.

The two-environment design

The solution separates preparation from installation. A connected machine downloads and packages everything. The isolated machine only consumes local files.

This boundary is the main architectural decision. Instead of teaching every component how to tolerate a disconnected network, I move all network-dependent work to a controlled preparation phase.

Preparing the payload

The repository provides one aggregate command:

cd offline-bundle
./scripts/download-all-artifacts.sh

The script requires Docker and at least 50 GB of free space. On macOS and non-AMD64 systems, it uses an Ubuntu 26.04 AMD64 container so that downloaded packages match the target architecture.

Each artifact group is handled by a specialized script. Completed steps have content and environment fingerprints, so an interrupted download can resume without rebuilding everything. A --clean option is available when a completely fresh payload is required.

The generated payload/ directory contains binaries, .deb packages, OCI image archives, Kubernetes manifests, tools, and model weights. It is generated locally and intentionally ignored by Git because it is large and reproducible.

The model and vLLM image are the largest parts of the bundle. In this setup, the vLLM image is around 8 GB compressed and the Qwen2.5 7B model requires roughly 14–15 GB. Disk planning is not optional; the target installer checks for at least 80 GB of free space before it starts.

One command on the isolated server

After preparing the payload, I copy the complete offline-bundle/ directory to the target and run:

cd offline-bundle
./install.sh

The installer elevates with sudo, verifies Ubuntu version and CPU architecture, checks free space, and validates the SHA256 checksum of every artifact. It then installs Ansible from local .deb files and runs a localhost playbook.

The playbook uses ansible_connection=local; SSH is not part of the installation design. The roles run in a deliberate order:

Install K3s from its binary and air-gap image archive.
Install the NVIDIA driver and container toolkit, then expose the GPU to Kubernetes.
Start the local registry and Git mirror, then install Argo CD.
Install k9s.
Deploy the observability stack.
Load the model and start vLLM.

The installer and roles are idempotent. If a validation fails, I can correct the problem and run the same command again.

What runs inside the server

The result is a complete platform on one machine. Some supporting services, such as the Git daemon, run on the host. Application and platform workloads run inside K3s.

K3s imports its standard air-gap archive directly. Additional images are imported into containerd and pushed to a registry listening on localhost:5000. Workloads use local image references, and the vLLM deployment uses imagePullPolicy: Never as an extra guard against accidental pulls.

The Qwen2.5-7B-Instruct snapshot is copied to /opt/models and mounted into the vLLM pod. vLLM exposes an OpenAI-compatible endpoint on port 8000 and gets the single nvidia.com/gpu resource advertised by the NVIDIA device plugin.

GitOps without GitHub

Argo CD usually pulls desired state from an external Git provider. That is impossible in an isolated network, so the bundle creates bare repositories on the target and serves them with a read-only Git daemon.

An in-cluster Service and Endpoints object exposes the host daemon as git://git-mirror.gitops.svc.cluster.local. Argo CD reads an app-of-apps repository from this address, discovers the agent application, and deploys its Helm chart.

This keeps the GitOps reconciliation model even when the platform cannot reach GitHub. The offline bundle is the delivery mechanism; the local Git mirror becomes the runtime source of truth.

The local AI application

To verify that the stack works as a platform rather than a collection of pods, I included a small chatbot. It uses FastAPI, LangChain, and ChatOpenAI, but points the client to the internal vLLM service instead of a public API.

The application supports a system prompt and conversation history. It also adds an OpenTelemetry span around every model invocation. Optional Langfuse integration can be enabled when a reachable Langfuse instance and credentials are provided.

I also tested the OpenAI-compatible endpoint with OpenCode. Pointing existing OpenAI clients at the local service is one of the useful properties of vLLM: applications do not need a custom inference protocol.

Observing the model and the GPU

An offline platform still needs normal operational visibility. The bundle includes a deliberately compact but complete stack:

Prometheus for metrics
Grafana for dashboards
Loki and Promtail for logs
Tempo for traces
OpenTelemetry Collector for OTLP ingestion
kube-state-metrics and node-exporter for Kubernetes and host metrics
NVIDIA DCGM exporter for GPU metrics

Prometheus scrapes vLLM metrics such as running requests as well as GPU utilization from DCGM exporter. Grafana provisions Prometheus, Loki, and Tempo as datasources and loads a bundled vLLM/GPU dashboard automatically.

The following Mermaid diagram shows how one chat request becomes all three telemetry signals:

Validation matters more in an air gap

In a connected environment, a missing image or package may be downloaded later. In an isolated environment, a missing transitive dependency can stop the entire installation. For that reason, validation is part of the design rather than a final checklist.

Before transfer, the preparation scripts verify the payload structure and checksums. During installation, the Ansible roles validate each layer before continuing. The acceptance checks include:

The K3s node reaches Ready.
The host reports the NVIDIA A10G with nvidia-smi.
Kubernetes reports nvidia.com/gpu: 1 as allocatable.
The vLLM pod starts without an external image pull.
The model loads from /opt/models/Qwen2.5-7B-Instruct.
The /v1/models and chat completion APIs respond.
Argo CD applications synchronize from the local Git mirror.
Prometheus targets are reachable.
Grafana contains the provisioned datasources and dashboard.
Loki returns logs and Tempo returns the chatbot trace.

The repository includes the exact commands in offline-bundle/VALIDATION.md, making the test procedure reproducible instead of relying on visual inspection alone.

Lessons learned

The most difficult part of an offline Kubernetes installation is not K3s itself. It is the complete dependency graph around it: OS packages, container images, GPU kernel modules, tools, model files, manifests, and the runtime services that normally assume internet access.

A few choices made the setup manageable:

Use one explicit network boundary. All downloads happen on the preparation host; installation uses local files only.
Pin and record artifacts. Image manifests and version files make the bundle understandable and reproducible.
Verify before transfer. Checksums catch incomplete copies and corrupted large files before Ansible makes changes.
Keep the runtime local. The registry, Git mirror, model storage, inference API, and telemetry backends all live on the target.
Validate each layer. A running pod is not sufficient proof that the GPU, model API, GitOps reconciliation, and observability pipeline work together.

This is a proof of concept and intentionally uses a single node. A production design would need decisions about high availability, storage redundancy, backup, security hardening, model lifecycle, and how signed bundle updates cross the air gap. Still, the project demonstrates that the same cloud-native workflows can operate in an isolated environment when artifact preparation is treated as a first-class part of the architecture.

Conclusion

With a prepared payload and a local Ansible playbook, I can turn an isolated Ubuntu GPU server into a small AI platform using one installation command. K3s provides the runtime, Argo CD preserves the GitOps workflow, vLLM serves a local model, and the observability stack makes the result operable.

The code is available in my GitHub repository: github.com/andygolubev/ansible-k3s-on-prem

Feel free to connect with me on LinkedIn.

I hope you enjoyed this article.

Top comments (3)

Max Quimby • Jun 29

The connected-prep / isolated-consume split is the right architectural boundary, and it's the part most "offline" guides hand-wave. Two things that bit us doing similar air-gapped vLLM deploys. First, the model weights and the vLLM image drift independently — pin the image digest but pull a "latest" Qwen revision (or the reverse) and you get a bundle that installs clean but serves a different tokenizer than you tested against, so fingerprint both in the same manifest, not just the artifacts you happened to think about. Second, the NVIDIA stack is the silent offline killer: the device plugin and container toolkit assume they can reach repos for the matching userspace libs, and getting driver + container-toolkit + the vLLM CUDA build into one compatibility window inside the payload took us more iterations than the rest of the platform combined. Did you pin the device-plugin and toolkit versions into the payload too, or lean on a pre-baked driver on the target image? And for model updates — full re-bundle, or did you get a weights-only delta path working?

VoltageGPU • Jun 29

Great article! I appreciate the practical approach to deploying AI workloads on an offline setup. Have you considered using containerd with gRPC for better GPU resource isolation? We ran into similar constraints when setting up VoltageGPU for confidential inference, and it helped with resource contention on multi-tenant nodes.

Raju Dandigam • Jun 30

This is a strong platform-engineering walkthrough because offline AI changes almost every assumption teams make in cloud-first deployments. I like the separation between preparation and installation; moving all network-dependent work into a controlled packaging phase is the right boundary. The inclusion of Loki, Tempo, Prometheus, Grafana, and OpenTelemetry is also important because offline does not mean unobservable. For production environments, I’d be interested in how you handle model/version provenance and rollback when the platform is fully air-gapped.