Biricik Biricik

Posted on Apr 18 • Edited on May 16 • Originally published at zsky.ai

52,000 Users, 7 Consumer GPUs, Zero Paid Ads: What Broke and What Held

#ai #opensource #devops #architecture

I was told that if you run a free AI image platform on consumer hardware, you'll either (a) go bankrupt or (b) go down. We crossed 52,000 users last week and we are, to my surprise, still up and still broke on purpose. Here are the load-bearing decisions in the architecture, the things that broke under real traffic, and the things I was sure would break and didn't.

Context: zsky.ai is an AI image and video tool that costs zero dollars to use. Not freemium — free. There's a subscription tier to support development, but the core generator is open to the public without an account. We run on seven consumer GPUs in my house and a small Supabase + Cloudflare edge.

I built it because I have aphantasia. I literally cannot see an image in my head — even my own mother's face is a feeling, not a picture. AI generation is the first technology that let me iterate visually on my own ideas without needing another person to translate for me. When I kept losing access to the hosted tools I depended on (cf. the Sora shutdown), I decided to run the infrastructure myself. This post is what I've learned from that choice meeting real traffic.

The stack, in one paragraph

Seven desktop-class GPUs spread across five machines on a 2.5GbE local network. An orchestrator/dispatcher on a CPU-heavy box that queues jobs and routes them to the least-loaded worker. Supabase for auth + Postgres + storage. Cloudflare for edge caching, DDoS, and the CDN. Nginx on the orchestrator for TLS termination and routing. Everything is commodity hardware from 2022-2024 — no datacenter, no hyperscaler bill.

The numbers, in one table

Metric	Value
Cumulative users	52,260
Peak concurrent renders	41
Daily generation volume	18,000-26,000
Uptime last 30 days	99.4% (two incidents, both my fault)
Total hardware cost	~$22k (five machines + GPUs, amortized over three years)
Monthly power cost	~$340 at my Florida utility rate
Paid advertising spend, lifetime	$0

We get one question about this table more than any other: "Why would you do this instead of just using a cloud GPU provider?"

The honest answer is that I don't trust the unit economics of the cloud GPU market at the consumer price point I want to serve. Free inference at cloud-GPU prices is a very fast path to an acquired-and-sunsetted product. I'd rather own the metal and control the cost floor. More on that philosophy on the free-image-gen page, but the short version is: if the electricity in the house can power the users, the users get it free.

What broke

1. The queue was the first thing

Our initial dispatcher was a naive round-robin. Worker 1 gets job 1, worker 2 gets job 2, and so on. This works until the jobs have wildly different costs, and ours do — a small 768px render is roughly 3 seconds of GPU time, and an 8-second video render is 180+ seconds. Round-robin would send a video job to a worker that was already mid-video while a peer worker sat idle on an image. Tail latency was awful.

Fix: weighted least-cost queueing, where we estimate job cost from the input params (resolution, duration, refiner toggle) and always dispatch to the worker whose current projected completion time is lowest. This single change dropped p95 latency from 34 seconds to 11 seconds on the same hardware.

This is one of those cases where you're taught something general in a distributed-systems class (cost-based scheduling > round-robin when jobs are heterogeneous) and you nod at it and then years later you hit it in production and go oh, that's what they meant.

2. Cloudflare's default cache is too aggressive

I spent half a day chasing a bug where deployed CSS changes wouldn't appear for some users. I finally realized Cloudflare was caching the HTML with our CSS reference for up to four hours at some edge PoPs. We'd update the CSS, users in our office would see the new version instantly (because our PoP was refreshed), and users in other regions would see yesterday's layout.

Fix: cache-bust every CSS/JS reference with a build hash and set Cache-Control: no-cache on the HTML itself. I added this to my personal "check every single time" list after losing a full day to it. Life lesson: the CDN is not your friend, it is your frenemy.

3. Supabase Row Level Security vs. high-cardinality reads

Our feed page originally did a full select * from generations where is_public = true order by created_at desc limit 50 with RLS turned on. Works great at 500 generations, works fine at 5,000 generations, chokes somewhere around 200,000 generations when every RLS policy has to evaluate against every row to figure out visibility.

Fix: a materialized view that snapshots the public feed every 60 seconds, served to unauthenticated users via the anon key with RLS disabled on the view. Signed-in users hit the live table with RLS on. The public endpoint's p95 went from 1.9 seconds to 80 ms. The lesson: RLS is correct for writes, cached views are correct for reads that don't need perfect freshness.

4. GPU thermal throttling is real and silent

We had one worker in a poorly-ventilated case that would hit 84°C and silently clock down to 60% of peak throughput. Nothing crashed. Nothing logged. Generations on that worker just took longer, and we got sporadic complaints about "slow renders."

Fix: exported nvidia-smi metrics to Prometheus every 15 seconds and set alerts on sustained temps over 78°C. Also replaced the case with a mesh-front one. Obvious in retrospect, completely invisible until I went looking.

5. Anonymous abuse

Letting people generate without an account is a core product value. It also means bad actors can fire a botnet at your generate endpoint and burn through your GPU-hours. Our first defense (per-IP rate limits) was trivially bypassed with a residential proxy network. Our second defense (CAPTCHA on first request of a session) had a 12% abandonment spike in signed-out usage.

Fix: a layered approach — CloudFlare's bot-score for the first check, behavioral signals (mouse entropy, time-on-page before first submit), and a soft gate that only escalates to a CAPTCHA when the request pattern looks automated. We lost about 2% of legitimate anonymous traffic. We blocked roughly 190,000 bot generations in March alone.

What I was sure would break, and didn't

The home internet

Our upload is 40 megabits. I was convinced that at 50 concurrent requests we'd saturate it serving image results. It turns out that (a) most generated images are 300-800KB, (b) Cloudflare's CDN eats the bulk of repeat views, and (c) most users immediately navigate away after seeing their result. At peak, we've used about 18 mbps sustained. This was the most pleasant surprise of the project.

The dispatcher

I was sure the single dispatcher node would become a bottleneck and I'd have to shard it. It hasn't. A plain Python FastAPI process on an older workstation-class CPU routes 20k+ requests a day and sits at about 4% CPU. It turns out routing a job to one of seven workers is not a hard problem unless you make it one.

Electrical

Everyone told me I'd melt the house. I drew up the load spreadsheet in fear before we turned on the fourth GPU. Peak household draw under full AI load is 4.1kW. My HVAC alone pulls 3.2kW. We have not tripped a single breaker. The panel upgrade we did two years ago for electric-car charging saved us here.

The philosophical part (because you can't talk about free AI for long without getting to it)

The only reason this is affordable is because we've constrained the problem. We are not trying to serve video in ten languages at 4K. We are trying to serve the ninety-percent case of creative AI — an image, a short clip, something good enough to iterate on — at a cost point that makes it genuinely free for the user. Once you accept that constraint, consumer hardware is not only viable, it's the best fit, because it lines the cost curve up with the price we're charging (zero).

I think a lot of the AI industry's pricing pressure right now comes from trying to serve one hundred percent of use cases on infrastructure that only makes sense for the top ten percent. If you accept being a ninety-percent-case tool, the math relaxes dramatically.

This is not a knock on GPT-5 or Sora or Midjourney. They are serving different constraints, and they're remarkable. It's just to say: there is room for another model, where the floor is free and the ceiling is "good enough," and that's the bet we've made.

Try it

Anonymous render, no signup: zsky.ai/create. If you're self-hosting and want to compare notes on dispatcher scheduling or the RLS-vs-materialized-view tradeoff, I'm at hello@zsky.ai and I read everything. If you're a creator looking for a free AI image generator that isn't going to pivot to $99/month next quarter, you're in the right place.

I'm Cemhan Biricik. I shoot for Vogue, won two National Geographic awards, and have aphantasia. I build AI tools because when I recovered from a TBI in 2014, photography was how I learned to see again — and I want that access for everyone, regardless of budget.

DEV Community