DEV Community

Cover image for Hosting an AI Image Generator on 7 Consumer GPUs in My Living Room: Architecture Deep-Dive
Biricik Biricik
Biricik Biricik

Posted on • Originally published at zsky.ai

Hosting an AI Image Generator on 7 Consumer GPUs in My Living Room: Architecture Deep-Dive

When people hear that zsky.ai runs on seven consumer GPUs in my living room, the usual reaction is a mix of disbelief and "that cannot possibly be stable." It is stable. It serves tens of thousands of users a day, generates both images and video, and the whole thing sits behind a single public endpoint. This post is the architecture walkthrough I wish someone had written when I was starting.

Why consumer GPUs at all

I'm a photographer with aphantasia — I cannot visualize images in my head. When I recovered from a TBI, the camera became my way of seeing. When generative AI arrived, it became an extension of that same instinct. I wanted to give everyone access to it without charging them rent-seeking prices, which meant I had to own the metal. Cloud H100s at list price would have killed the unit economics on day one.

Seven RTX 5090s, on the other hand, give me enough aggregate VRAM (224 GB) and enough raw FP8/FP16 throughput to fan out work across models, at a capex that pays back in weeks if the platform works. So I bet on prosumer hardware and a smart dispatcher.

The physical layout

  • 1 head node (CPU, no GPU) — runs nginx, the API, the queue, Postgres connection pool, and auth.
  • 7 worker nodes — each hosting one or more GPUs. Mixed: some are single-GPU desktops on the LAN, some are dual-GPU boxes.
  • All nodes on a 2.5 GbE switch with jumbo frames. Tailscale overlay for anything that crosses the NAT boundary.
  • A fan-out storage layer for model weights — each worker preloads its assigned models at boot so cold start is only paid once.

The head node is intentionally boring. It has one job: accept requests, authenticate them, push them onto the right queue, and stream results back.

The dispatcher

The core abstraction is a "capability tag" per worker. Every worker registers itself on boot with something like:

{
  "worker_id": "gpu03",
  "host": "10.0.0.13",
  "capabilities": ["image.fast", "image.hq", "upscale"],
  "vram_gb": 32,
  "concurrency": 2,
  "warm_models": ["image-v2", "image-hq"]
}
Enter fullscreen mode Exit fullscreen mode

The dispatcher keeps this registry in Redis with a 15-second TTL — every worker heartbeats every 5 seconds. If a worker goes silent (game launches on one of the dual-use boxes, driver hiccup, whatever), it drops off the routing table automatically.

Routing logic is plain Python, because it was plain Python two years ago and it still works:

def pick_worker(job):
    candidates = [
        w for w in registry.all()
        if job.capability in w["capabilities"]
        and w["inflight"] < w["concurrency"]
        and job.model in w["warm_models"]
    ]
    if not candidates:
        # fall back to any worker with the capability, even if cold
        candidates = [
            w for w in registry.all()
            if job.capability in w["capabilities"]
            and w["inflight"] < w["concurrency"]
        ]
    if not candidates:
        return None  # queue it
    # prefer the worker with the lowest (inflight / concurrency) ratio
    return min(candidates, key=lambda w: w["inflight"] / w["concurrency"])
Enter fullscreen mode Exit fullscreen mode

Two things matter here:

  1. Warm-model preference. A request that lands on a worker whose model is already in VRAM starts generating in under a second. A request that lands on a cold worker pays a 6-12 second load penalty. So the dispatcher treats warmth as a first-class routing feature, not an afterthought.
  2. Load ratio, not raw load. Workers have different concurrency limits based on VRAM headroom and model size. Comparing raw inflight counts punishes the beefier boxes. Ratios normalize it.

The queue

Every job that cannot be placed immediately goes into a Redis Stream, partitioned by capability. A small pool of async workers on the head node pulls from streams and re-runs pick_worker every few hundred milliseconds as workers free up. Pseudocode:

async def queue_worker(capability):
    stream = f"q:{capability}"
    while True:
        entries = await redis.xread({stream: "$"}, block=500)
        for _, msgs in entries:
            for msg_id, fields in msgs:
                job = Job.from_stream(fields)
                worker = pick_worker(job)
                if worker:
                    await dispatch(worker, job)
                    await redis.xdel(stream, msg_id)
                else:
                    # leave it in the stream, try again next tick
                    pass
Enter fullscreen mode Exit fullscreen mode

There are three knobs I tune:

  • Capability fan-out. Fast image jobs have 4 workers. HQ image jobs have 2. Video jobs have all 7 when the platform is quiet and 3 when it is busy — video is long-running, so I cap its share to keep image latency bounded.
  • Priority lanes. Paid-tier jobs go into a separate stream and the dispatcher drains it first. The free tier is still fast (usually under 5 seconds) because the capex is low enough to leave headroom.
  • Backpressure. When a queue's depth exceeds a threshold, the API returns a "try again in N seconds" hint instead of silently queuing. Honest wait times earned more trust than trying to hide the load.

Cold start is the enemy

The single biggest win for latency was treating cold starts as a bug, not a fact of life. Three things helped:

  1. Model pinning per worker. Each worker is told at boot which models to keep resident. I do not try to dynamically swap — swapping a 14 GB model in and out of VRAM is slower than any queueing delay I'd save.
  2. Warmup requests. Every worker, after boot, fires a synthetic job through each of its warm models. This pages the weights in and jit-compiles any kernels. By the time the worker announces itself to the dispatcher, it is hot.
  3. Graceful drain. When I deploy, I remove a worker from the registry first, let its inflight jobs finish, then restart. Users never see a 500.

Load balancing across heterogeneous boxes

Not every GPU is equal. A 5090 on PCIe 5.0 x16 is meaningfully faster than the same card on a low-budget board with PCIe 4.0 x8. I measured real throughput per worker for each capability and stored a score field in the registry. The dispatcher uses it as a final tiebreaker: among two workers with equal load ratios, pick the higher score.

This mattered more than I expected. Heterogeneous hardware without scoring meant the slowest box became the tail-latency outlier.

What I would do differently

  • Postgres advisory locks before Redis for coordination. Redis works, but I have had two brownouts in two years and Postgres has had zero.
  • Earlier observability. I went nine months before wiring Prometheus. Do not be me.
  • Stop optimizing for concurrency on the worker. One-in-one-out with fast models beats two-in-two-out with a model that thrashes VRAM.

The numbers

At the time of writing, this cluster serves about 48,000 signed-up users, dispatches roughly 120-140k image jobs a day, and holds p50 image latency under 3 seconds and p95 under 7 seconds during peak. Video is slower by nature but runs on the same dispatcher with different capability tags.

All of this runs on electricity I pay for out of my living room, which is a strange sentence to type. But it is also why the free tier on zsky.ai can be genuinely free — 200 credits at signup and 100 per day, no card required.

If you want to build something similar, start with the dispatcher. Everything else is a detail around it.


I'm Cemhan Biricik. I'm a photographer with aphantasia who recovered from a TBI through photography and ended up building ZSky AI, a free-forever AI image and video platform. I write about infrastructure, AI tooling, and the artist-engineer overlap.

Top comments (0)