Dener Fernandes

Posted on May 22 • Edited on May 25

I decided to build my own alternative to Kubernetes. Here's the architecture I chose (and why).

#opensource #kubernetes #devops #infrastructure

In my last post I explained why I decided to think "ok, I'm going to build an alternative to Kubernetes." I got a lot of comments — some supportive, others questioning if I really knew what I was getting into. Fair enough. But the decision was already made, so the next step was simple: stop thinking and start doing.

The name

First thing I did was pick a name. Might seem silly, but I always do this with my projects — naming something turns a vague idea into something real. It stops being "that crazy thought I had in the shower" and becomes an actual thing.

This time the name came instantly. One of the main goals of the project is to radically simplify the experience of getting things running. No stress, no 40-hour courses, no infinite YAML. Something that feels almost... magical.

So: Houdini.

Harry Houdini was the world's first great magician and escape artist. Maybe it's pretentious to compare an open-source project with the guy who escaped coffins underwater, but the name fits perfectly with what I want to deliver: making something complex look simple.

The study phase

Let me clarify something that was confusing in the previous post — some people understood that I don't know Kubernetes at all. I don't blame them, I probably wasn't clear. When I mentioned "using AI to create the cluster" I was referring to a time and efficiency thing, not total ignorance. I've been using and deploying K8s in production for years.

But there's a difference between using Kubernetes and creating something that competes in the same space. For that, I needed to go much deeper.

I started studying the Kubernetes source code. Not the docs, not the tutorials — the code. How things were implemented, why certain decisions were made, what trade-offs were accepted. I did the same with HashiCorp's Nomad. This study even led me to consider getting the CKA certification — not for the certificate itself, but for the structured learning process.

AI helps a lot at this stage, I'll be honest. Reading 2 million lines of Kubernetes Go code without any guide would be insane. But the understanding has to be mine — AI accelerates, it doesn't replace.

It's not a new Kubernetes

A crucial point: I'm not trying to build a new Kubernetes. A solo developer will hardly achieve that, and it's not the goal anyway. I want to build an alternative. Not better, not worse — simply a different way of thinking about how an orchestrator should work.

The project is opinionated. Unapologetically. Because it reflects how I would like an orchestrator to work, based on 22 years of writing code and deploying to production.

The design: learning from others' mistakes

To arrive at Houdini's design, I did three things:

Studied the architecture of Kubernetes and Nomad in depth
Catalogued the main community criticisms of K8s, Nomad, and Swarm
Combined all of that with my practical experience of what works and what doesn't

The most common criticisms I found:

Kubernetes:

Absurd complexity for simple use cases
Dozens of components to maintain (etcd, kube-apiserver, scheduler, controller-manager, kubelet, kube-proxy...)
YAML hell — verbose configs that are painful to debug
Service mesh requires external components (Istio, Linkerd)
Learning curve measured in months

Nomad:

Simple but incomplete — needs Consul for service discovery, Vault for secrets
No built-in service mesh
Smaller community, less tooling

Docker Swarm:

Abandoned by Docker Inc.
No real autoscaling
Limited deployment strategies

Houdini's architecture

With all of that in mind, these are the design decisions I made. Note: I'm not saying these are immutable — the project is in active development and things may change as I encounter real problems.

Principle #1: One binary, zero external dependencies

$ houdini server   # control plane
$ houdini agent    # worker node
$ houdini deploy   # CLI

Same binary does everything. No separate etcd, no Consul, no Vault, no Prometheus server. Everything embedded. Storage uses BoltDB (an embedded key-value store in Go), distributed consensus uses Raft (same library Nomad uses), service mesh uses native WireGuard.

Why? Because the biggest source of operational complexity in Kubernetes isn't the concept — it's the 15 components you need to keep alive. I want someone to be able to spin up a working cluster with literally houdini server and houdini agent --server <ip>.

Principle #2: Multi-runtime

Houdini doesn't just run containers. It supports four types of workload with the same API:

Container — Docker, for when you need full isolation
Process — native OS processes, for local dev or binaries that don't need a container
WASM — WebAssembly modules, <1ms startup, lightweight sandboxing
Function — serverless, event-driven, automatic scale-to-zero

Why? Because not every workload needs a container. A Python script that runs every 5 minutes doesn't need a 200MB Docker image. A webhook handler doesn't need to be alive 24/7 consuming RAM.

Principle #3: Networking built-in

Each node gets a WireGuard subnet (10.42.X.0/24). Containers communicate directly between nodes through encrypted tunnels. Internal DNS resolves service.namespace.houdini to the correct IPs. Ingress controller with automatic HTTPS (Let's Encrypt) included.

Why? Because in Kubernetes you need to choose between Calico, Flannel, Cilium for CNI, then install an Ingress controller (nginx, traefik, envoy), then configure cert-manager for TLS. That's 3-4 decisions and installations before external traffic reaches your service.

Principle #4: Smart deploy pipeline

Deploying isn't "send the YAML and pray." It's a pipeline with phases:

Validation — does the spec make sense?
Scheduling — where to run? (bin-pack or spread, with anti-affinity, constraints, failure scoring)
Dispatch — direct push to agent via gRPC stream

Four strategies: Rolling (zero-downtime), Canary (test with % of traffic), Blue-Green (atomic switch), All-at-once (fast, accepts downtime).

If it fails? Retry with exponential backoff. Exhausted all attempts? Dead Letter Queue — nothing is silently lost.

Principle #5: Continuous reconciliation

Every 30 seconds, a reconciler compares the desired state (what you asked for) with the actual state (what's really running). If they diverge, it corrects. Workload crashed? Restart with backoff. Node died? Reschedule to another node. Orphan container? Automatic GC.

Why? This is the pattern Kubernetes absolutely nailed — declarative + reconciliation. I copied it. I'm not ashamed of copying what works.

Principle #6: Security is not optional

Secrets encrypted with AES-256-GCM (Argon2id for key derivation)
RBAC with 4 roles (admin, operator, developer, viewer)
2FA with TOTP
Agent↔server communication with token + gRPC streams
Policy engine for admission control (blocks :latest in prod, requires health checks, etc.)

Simplified overview

┌─────────── CONTROL PLANE (houdini server) ──────────────┐
│                                                          │
│   REST API (:4646)  │  gRPC (:4648)  │  Ingress (:443) │
│                                                          │
│   Scheduler │ Reconciler │ Deploy Pipeline │ Autoscaler  │
│                                                          │
│          State Store (BoltDB / Raft cluster)             │
└──────────────────────────┬───────────────────────────────┘
                           │ gRPC Streams + WireGuard Mesh
                           ▼
┌──────── WORKER NODE (houdini agent) ─────────┐
│                                               │
│   Runtime Registry                            │
│   ├── Docker (containers)                     │
│   ├── Process (native)                        │
│   ├── WASM (modules)                          │
│   └── Function (serverless)                   │
│                                               │
│   Workload Manager │ Health Probes │ Logs     │
└───────────────────────────────────────────────┘

But what about the user experience?

I know what you're thinking: "Ok, lots of technical stuff under the hood, but what about in practice?" Good question.

The point is that all this complexity — scheduling, mesh, autoscaling, reconciliation — exists so that the user doesn't have to think about it. All of this will be managed transparently through a Web UI where you configure, deploy, monitor, and scale your services without ever opening a terminal if you don't want to.

The idea is simple: if you want to go deep with TOML and the CLI, go ahead. If you want to click "Deploy" and go grab a coffee, you can do that too. No judgment.

The Web UI will be the topic of a future article — and I promise that one will be way less technical than this one. Actually, this is probably the most technical article I'll write in this series. If you survived this far, the next ones will be a breeze.

Next steps

The project isn't public yet, but it will be soon. I'm preparing everything so that when I open the repository, people can actually clone, run, and test it — I don't want to open something half-baked and give the wrong first impression.

In the next post I'll talk about the Web UI and the experience I want to deliver for those who don't want (or need) to live in the terminal.

Until then, if you have questions, criticism, or suggestions — send them. Especially if you disagree with any decision. That's how projects improve.

Top comments (1)

Sephyi • May 31

I had a reply half-written, then saw the comment pointing here. Reading through the refactored text, I hit "Docker, for when you need full isolation" and it lodged in my brain like a splinter.

What do you mean by "full isolation"? It's Docker. Containers don't really give you that in the general case. I'd have reached for Podman personally. You're also overcomplicating the writing in places: you could just say "meshed WireGuard".

In my view the whole point of containers is a clean default base, so why turn it into rocket science instead of just putting everything in one? You won't reclaim any resources worth caring about by being clever here, and you take on extra security risk and a pile of headaches for nothing. If I approached it differently, I might drop Docker entirely and run everything under systemd. That loses Windows, but that's a separate story. Though to be fair, if I remember right, Podman drops out on that front too, since it doesn't support Windows-based containers. And it probably wouldn't buy you the slightest benefit anyway. I never benchmarked it, but technically containers are very light: basically cgroups and a lot of glue running on the host Linux.

On learning from how others did it: I'm not sure that's a good idea, at least not at the start. It biases you toward what you've already seen. Did you maybe do it later in the process instead? Either way, I think it ultimately doesn't matter much how others approached it; I'd focus on my own use case first. Kubernetes isn't a real reference point here anyway. It's built for a problem most people don't actually have.

I'd been chewing on this myself since first reading the post a few days ago, and here's a stack I came up with in theory:

Headscale — network mesh
CoreDNS — name resolution
Caddy — reverse proxy, ingress, Let's Encrypt
Podman — containers
A Rust control plane that orchestrates and talks to the cloud provider / Proxmox API for elasticity.

Two things I'd push back on:

Why a private repo? Open it up. Someone might read the code and contribute something useful, or catch a design flaw early before it becomes catastrophic down the line. If you're shy about the code, I get it. But then why announce this so early? You lose all the momentum: nobody's going to bookmark this and check back every few months to see if anything shipped. Except maybe me, since I didn't have time to comment when it first went up.

Why build this at all? Almost everything I make comes from being unhappy with an existing tool, or because the thing just doesn't exist yet. What's your itch? Are you running workloads that genuinely need autoscaling? For me, as an example: I hate writing commit messages but want everything to be perfect, so I built an AI commit generator that actually doesn't suck.