Say That Sh

Posted on Mar 15

I self-host my SaaS on three mini PCs for $11/month. Here's the full setup.

#selfhosted #kubernetes #homelab #devops

SayThat.sh runs on three Intel i5-10500T mini PCs sitting under my desk. No AWS. No Vercel. No monthly cloud bill. The entire production infrastructure - database, cache, API, frontend, monitoring, analytics, CI/CD - runs on hardware I own, for about $11 per month in electricity.

This is how it's set up and why I went this route.

The hardware

Node	CPU	RAM	Storage	Role
st-cp-01	i5-10500T (6c/12t)	24 GB	2TB NVMe SSD	Control plane + builds
st-cp-02	i5-10500T (6c/12t)	16 GB	256GB NVMe SSD	Control plane
st-cp-03	i5-10500T (6c/12t)	16 GB	256GB NVMe SSD	Control plane

Three machines, 36 cores, 56 GB RAM total. Each one cost somewhere between $120 and $250 used. All three are k3s control-plane nodes running embedded etcd for HA consensus. There is no separate worker tier - every node participates in both cluster management and workload scheduling.

The software stack

Orchestration: k3s with embedded etcd - any node can go down and the cluster keeps running.

Database: CloudNativePG operator managing a 2-instance PostgreSQL 15 cluster. Streaming replication, automatic failover in 5 to 30 seconds, built-in PgBouncer pooling, and native Prometheus metrics with no separate exporter.

Cache: Redis Sentinel - 3 Redis instances and 3 Sentinel instances as StatefulSets. Automatic master election in about 5 seconds, transparent to the application via ioredis Sentinel-aware connections.

Storage: Longhorn for distributed block storage replicated across nodes.

Networking: Cloudflare Tunnel (2 replicas) for ingress. No port forwarding, no public IPs. Traffic goes: user to Cloudflare edge, through an encrypted tunnel, directly to cluster services. No Traefik, no Nginx in the path.

Monitoring: VictoriaMetrics (52 alert rules), Grafana (16 auto-provisioned dashboards), VictoriaLogs and Grafana Alloy for log aggregation, Alertmanager for routing to email and Telegram. Dashboards cover API performance, business metrics, security events, audit logs, SLO tracking, PostgreSQL internals, Redis stats, Longhorn storage, n8n automation, and more.

Deploy: Direct deploy via a build script that builds Docker images, pushes to the in-cluster registry, and triggers rolling updates via kubectl set image. No GitOps intermediary - I want to see the build output and control exactly when things ship.

Builds: Docker BuildKit with registry cache. Frontend builds in about 35 seconds from cache, API in about 40 seconds.

Automation: Ansible with 15 roles and 14 playbooks covering cluster bootstrap, deployment, image builds, monitoring setup, and pre-maintenance sequences.

Why self-host

The honest answer is that I like running infrastructure. But there are practical reasons too.

Cost. The equivalent setup on AWS - three EC2 instances, RDS with a read replica, ElastiCache, S3, CloudWatch, an Application Load Balancer - would run $200 to $400 per month. I paid a one-time cost for hardware and spend $11 per month on electricity. Even accounting for hardware replacement every few years, the math works out heavily in favor of self-hosting at this scale.

Control. I can SSH into any node, inspect any container, read any log, modify any config without going through a cloud console. There is no abstraction layer between me and what is actually happening. When something breaks, I am looking directly at it.

Learning. Running your own Kubernetes cluster teaches you things that managed services deliberately hide. Networking, distributed storage, failover mechanics, security isolation - you deal with all of it directly. That knowledge transfers.

Performance. NVMe drives on a local network are faster than most cloud block storage. Sub-millisecond latency between nodes. No noisy neighbor problem.

High availability - actually tested

This is not just HA on paper. I have tested full node failures. Here is what happens when I pull the plug on a node:

k3s detects the node is unavailable
CloudNativePG promotes the PostgreSQL streaming replica to primary (5 to 30 seconds depending on timing)
Redis Sentinel elects a new master (approximately 5 seconds)
Kubernetes reschedules pods to surviving nodes
Cloudflare Tunnel keeps routing through the remaining replica
When the node comes back online, it auto-uncordons and rejoins without any manual intervention

Zero manual steps. I have done this multiple times and confirmed the site stays up throughout.

What makes this actually work in practice:

HPA scale-down guardrails - stabilization windows and max-scale-down policies prevent Kubernetes from reducing pod counts too aggressively during a node drain, which would cause unnecessary downtime
Pod anti-affinity - critical services spread across nodes so a single node failure does not take out both replicas of anything
PodDisruptionBudgets on the API, frontend, and Cloudflare Tunnel - minAvailable: 1 ensures at least one replica survives any disruption
Graceful drain automation - a systemd ExecStartPost hook triggers automatic node uncordon after k3s starts back up; a corresponding drain script handles clean Longhorn volume detachment before stopping

The three-tier DR backup

Disaster recovery is also tested. The backup strategy has three tiers:

Tier 1: Continuous PostgreSQL WAL shipping. The Barman Cloud Plugin for CloudNativePG ships WAL files continuously to MinIO on a separate machine on the LAN (RPO under 5 minutes). Daily base backups run via ScheduledBackup CRs.

Tier 2: Daily Restic snapshots. A CronJob encrypts and ships Kubernetes secrets and pg_dumps to MinIO daily. Recovery point is 24 hours.

Tier 3: Longhorn volume snapshots. Longhorn's native S3 integration ships daily volume snapshots to MinIO for non-database persistent volumes.

All three tiers target a separate physical machine. If the entire k3s cluster goes down, I can restore from cold with the DR runbook and have the site back without manual data reconstruction.

Prometheus alerts fire on stale WAL archiving, stale base backups, stale Restic snapshots, and stale Longhorn backups - so backup failures are caught before they matter.

Security without cloud security theater

Running your own infrastructure means you own the security posture too.

18 Kubernetes NetworkPolicy objects restrict ingress and egress per service. The database only accepts connections from the API, worker, PgBouncer, monitoring, and backups. Redis only accepts connections from the API, worker, Sentinel instances, and the Redis exporter. These are enforced by k3s 1.34 via kube-router - not aspirational policies.

Every pod runs as a non-root user with dropped Linux capabilities. The Cloudflare Tunnel model means there are no public-facing ports at all - the home network is invisible from the internet.

On the application side: rate limiting at three levels (3 req/sec burst, 20 req/10sec, 100 req/min) with per-endpoint overrides. Cloudflare Turnstile CAPTCHA and honeypot fields on public forms. CASL-based role permissions. JWT in httpOnly cookies with refresh token rotation. Dual-method 2FA. Audit logging with GeoIP enrichment for every significant action.

What I would do differently

Start with 2 nodes. I started with a single machine and retrofitted HA later. Two nodes from day one would have saved meaningful rework.

Longhorn needs careful handling. Volumes must be cleanly detached before stopping a node - unclean detachment causes ext4 corruption. The multipathd daemon will claim Longhorn's iSCSI block devices unless you blacklist them in /etc/multipath.conf. PostgreSQL needs a subdirectory inside a Longhorn volume because the ext4 root contains lost+found, which PostgreSQL rejects. Solvable problems, but they cost time.

Ansible from day one. Shell scripts work, but Ansible playbooks are idempotent and self-documenting. Migrating later was unnecessary friction.

Watch the BuildKit registry cache. After a cached build, changing source code can still serve the old compiled TypeScript layer. Rebuild with --no-cache when something looks off. And k3s's containerd cache ignores imagePullPolicy: Always - the reliable fix is pinning deployments to the exact SHA256 digest.

The numbers

Current cluster utilization sits at about 22% CPU and 25% memory. The platform can handle 10,000 or more daily active users without any infrastructure changes.

DAU	Monthly Cost	What Changes
1 to 10K	$11	Nothing
10 to 50K	$31	Enable Cloudflare edge caching
50 to 100K	$31 to $42	Add 1 to 2 mini PCs
100 to 500K	$42 to $73	6 to 9 nodes total

The single most impactful optimization that hasn't been made yet is enabling Cloudflare edge caching on the homepage and read-only API endpoints. That alone would reduce origin load by 70 to 80% at zero additional infrastructure cost.

Is this for everyone?

No. If you want to ship fast and not think about infrastructure, use a managed platform. If your time is worth more than the cloud bill you'd pay, use a managed platform.

But if you're a solo developer who enjoys the full stack, if you want to understand how distributed systems actually work rather than just using abstractions over them, and if you'd rather own your infrastructure than rent it indefinitely - self-hosting on mini PCs is a real option. The hardware is cheap, the software is open source, and the electricity costs less than a basic Heroku dyno.

The site is at saythat.sh. Happy to answer questions about any part of the setup - the Longhorn quirks, the CNPG configuration, the Cloudflare Tunnel routing, the DR backup architecture, any of it.

What would you do differently if you were building this?

DEV Community