Carlos Pereyra

Posted on Jun 2 • Originally published at pereyra.ar

I call my homelab on the phone and an AI agent picks up and runs my infra

#devops #kubernetes #ai #selfhosted

Originally posted (in Spanish) on my blog: pereyra.ar/blog/clau-tg

TL;DR

For months I'd been building scattered labs in my homelab: a multimodal RAG, a WhatsApp router, a GPU switchboard, Claude Code running on K3s, a realtime lip-sync avatar. Each one solved a single thing and lived on its own. This week I plugged them all behind a single voice agent — clau — and the way I talk to it is literal: a Telegram phone call.

The interesting part wasn't the voice agent itself (there are a thousand of those). It was that, once all the labs were connected, clau stopped being "a voice bot" and turned into the primary interface for operating my infra. I talk to it, it sees my screen, it shows me its own, it runs kubectl against my cluster, it browses the web, and when something is out of its reach it delegates to another agent with more permissions and comes back with the answer, spoken.

This post covers the architecture, the "voice agent as an orchestration layer" pattern, and the honest tradeoffs of building this on-prem.

The problem: labs that didn't talk to each other

Over the last year I kept adding pieces to the homelab, each with its own purpose:

Lab	What it does	Stack
Multimodal RAG	Searches my docs/images	FastAPI + Qdrant + Gemini
WhatsApp Router	Sends/receives WA messages	Evolution API + ws-router
GPU Switchboard	Swaps LLM profiles on the 3090	custom daemon on the AI server
claupod	Claude Code for long reasoning	K3s on Proxmox
Realtime avatar	Talking head with lip-sync	MuseTalk + TensorRT

The classic problem: each one had its own way of being invoked. To do anything "end to end" I'd end up with three terminals open, a WhatsApp chat, and a dashboard. The friction wasn't in the capabilities — they were all there — it was in the glue between them.

The idea: the voice agent as an orchestration layer

clau lives on my workstation, picks up a real Telegram call when I call it (or calls me proactively), and speaks with its own voice via Gemini Live. So far, just another voice agent.

The phase change was connecting the labs I already had, one by one. Each lab is exposed as a native tool on the agent. Today there are around 50 native tools, plus the ability to delegate the heavy stuff to Claude Code. All on-prem, with my credentials, my RAG, my skills.

The difference from a chatbot comes down to two capabilities I don't see that often in agents:

It sees what I'm doing. I share my screen or my camera and it reasons over that.
It shows me what it's doing. It shares its terminal or its browser inside the call's video, so we both work on the same thing.

How it works, in practice

In the video I call it and ask it to show itself. The real walkthrough:

It operates the cluster. I ask it to share the terminal and run commands:

kubectl get nodes
# k3s-master, k3s-worker-01, ubuntu  → all Ready
kubectl get ns
# agents, cloud, default, kube-system, rag-multimodal, ...
kubectl -n cloud get pods
# atlassian-mcp, evolution-api, message-fileserver Running; some Jobs Completed

It doesn't just run things: it sees the output in its own shared terminal and we talk about what shows up. It's operating the infra by conversation.

It browses the web. I tell it "open my site and go to the Grafana dashboard". It shares the browser (Chromium + Playwright), navigates, tries to log in — the password goes in without the model ever seeing it — and once it's in, I ask for a screenshot of the panel and it sends it to me over WhatsApp. The WA integration isn't an extra: it's part of the same call flow.

It queries the RAG, multimodally. On top of the RAG I built a detail I liked: in parallel with the classic vector search (Qdrant + Google embeddings), I store the same information in an LLM-as-a-wiki structure (the approach Andrej Karpathy talked about). A layer on top analyzes the query: if it's generic/open-ended, it walks the wiki; if it's very specific, it goes to the vectors; or it combines both — first a wiki approach, then the fine-grained search. I sent it the Apple logo over WhatsApp and it returned the description it had generated. The trigger phrase is "use the RAG to…".

It schedules and checks services. It creates a Google Calendar event, tells me the status of the FLUX service on the AI server, looks at the camera and describes my setup (yes, it also spotted the mate).

The pattern that helped most: delegation

clau doesn't solve everything during the call. For tasks that need long reasoning or permissions the voice agent doesn't have, it delegates to claupod — a Claude Code instance running on K3s. claupod does the work (including SSH into configured external servers to fetch a specific piece of data), returns the result, and clau relays it to me, spoken, on the same call.

It takes a bit longer because the task is delegated, but the pattern is the good part: the voice agent is the conversational face; the agents with permissions do the dirty work. The voice layer never touches sensitive credentials; it orchestrates.

The avatar (and why it has a delay)

The last bit is that clau shows up with a realtime webcam avatar: MuseTalk with TensorRT generating the lip-sync on the AI server. It's honestly the most expensive thing in the whole pipeline — it needs a lot of realtime compute — so it carries a ~1 second offset with the audio. The video is realtime; the fine sync is what's left to improve. I'm leaving it as is: good enough to show the idea, not for production.

Honest tradeoffs

Latency is the model, not the network. I measured the round-trip to claupod: 99% of the time is Gemini reasoning (4-8s depending on tools); the internal network is <5ms. Optimizing here is about the prompt and the tools, not the infra.
You have to tame the model. Gemini Live fires tools at random when faced with confusing questions, and sometimes hallucinates identifiers. You fix it with explicit, inviolable rules in the prompt plus fallbacks, but it's fiddly work.
Shared GPU = warmups. Everything hangs off a single 3090 with time-multiplexing. The first call to a freshly swapped model pays the load cost. For my case (one user, me) it's perfect; for real concurrency, no.
The browser doesn't handle 2FA/SSO. You can see it in the video: logging into Grafana with two-factor stops it. It's a known limit of the browser share.
Avatar with a ~1s offset. Already mentioned.

None of this is "revolutionary". It's well-placed glue over pieces that already worked, and for a single-user homelab it pays off a lot.

Stack

Voice: Gemini Live (Aoede voice) over a Telegram call (hydrogram + pytgcalls).
Orchestration: native tools on the agent + delegate to Claude Code (claupod) on K3s.
Infra: K3s on Proxmox, NAS for PVs, a single RTX 3090 with a GPU Switchboard.
RAG: FastAPI + Qdrant + Google embeddings + an LLM-as-a-wiki layer.
Messaging: Evolution API + ws-router (WhatsApp).
Browser: Chromium + Playwright (headless CDP screencast).
Avatar: MuseTalk + TensorRT.
Auxiliary STT/TTS: self-hosted Whisper + Piper.

Wrap-up

The takeaway: the value wasn't in building one more agent, but in giving the labs I already had a single conversational front door. A phone call ended up being the best interface to operate the homelab while I walk around the house.

There's stuff left to improve (the avatar offset, taming the model better, handling SSO in the browser), so there will be more videos.

If you're building something similar or want to discuss the on-prem approach, reach out — I'm happy to chat.

Top comments (2)

Gilder Miller • Jun 12

This is a pretty neat setup. What stood out to me is how the voice layer became the front door to the whole system rather than just another interface on top of it. The delegation pattern also feels well thought out since it keeps the operational boundaries clear while still making the experience feel seamless. Sometimes the most useful projects are just solid integration work done well.

Manuel Bruña • Jun 9

Wow, this project is amazing! Excellent work!