Biricik Biricik

Posted on Apr 14 • Edited on May 16 • Originally published at dev.to

I Built a Free AI Video Generator on 7 GPUs in My Living Room -- Here's What I Learned

#ai #webdev #startup #opensource

I spent the last year building an AI video generator that runs on seven GPUs sitting in a room in my house. No AWS. No GCP. No "serverless" anything. Just power-hungry hardware, a lot of thermal paste, and an unreasonable stubbornness about not paying cloud bills.

Here's what I learned building ZSky AI from bare metal to 43,000+ users.

Why Self-Host Video Generation at All?

The short answer: cloud inference for video is obscenely expensive.

A single 1080p AI video generation takes 30-90 seconds of sustained GPU compute. On cloud providers, that's $0.50-$2.00 per generation depending on the model and resolution. When you're giving away 200 free credits to every user plus 100 daily, those numbers become existential fast.

I ran the math at 3,000 signups per day:

Cloud cost per user (conservative):
- 20 generations/day average
- $0.75/generation (mid-tier cloud GPU)
- = $15/user/day
- x 3,000 new users/day
- = $45,000/day in NEW user compute alone

Self-hosted cost per user:
- 7x RTX 5090 @ $2,000 each = $14,000 (one-time)
- Power: ~$400/month
- Amortized over 3 years: ~$800/month total
- = effectively $0.003/generation

That's not a rounding error. That's the difference between burning $1.35M/month and spending $800/month. Even at 10x my current scale, self-hosted wins.

The Architecture (Without the Buzzwords)

The stack is deliberately boring:

[nginx reverse proxy]
       |
[Flask API dispatcher] -- routes to GPU workers
       |
[7x RTX 5090 workers] -- each runs inference independently
       |
[R2 object storage] -- serves generated media via CDN
       |
[Supabase] -- auth, credits, user state

No Kubernetes. No Docker orchestrator. No service mesh. Each GPU worker is a Python process that pulls jobs from a queue, runs inference, uploads the result, and reports back. If a worker crashes, it restarts. If it's slow, the dispatcher routes around it.

I tried the "proper" infrastructure approach first -- containerized workers, Kubernetes for orchestration, Prometheus for monitoring. It took three weeks to set up, broke constantly, and added 200ms of overhead to every request. I ripped it all out and replaced it with a 400-line Python dispatcher.

# Simplified version of the dispatch logic
def assign_job(job):
    workers = get_available_workers()
    if not workers:
        return queue_job(job)

    # Pick the worker with the most free VRAM
    best = min(workers, key=lambda w: w.current_load)

    # If even the best worker is >80% loaded, queue it
    if best.current_load > 0.8:
        return queue_job(job)

    return best.submit(job)

The boring approach handles 3,000+ signups per day without incident. The Kubernetes approach couldn't survive a weekend.

Lesson 1: Thermal Management Is Your Actual Bottleneck

Seven 450W GPUs in a residential space produce roughly 3,150 watts of heat continuously. That's like running three space heaters 24/7 in one room.

My first summer, the GPUs started thermal throttling at 2 PM every day. Generation times went from 45 seconds to 3+ minutes. Users thought the service was broken.

The fix wasn't a better cooling solution -- it was scheduling. Heavy batch jobs run overnight when ambient temperature drops. During peak heat hours, the dispatcher automatically shifts to lower-resolution fast-pass generations and queues heavy work for later.

# Thermal-aware scheduling
def get_max_concurrent(gpu_temp, ambient_temp):
    if gpu_temp > 82:  # Celsius
        return 1  # Single job only, let it cool
    if ambient_temp > 28:
        return max(1, DEFAULT_CONCURRENT - 2)
    return DEFAULT_CONCURRENT

Not glamorous. But it works better than a $10,000 cooling system.

Lesson 2: The Queue Is the Product

Users don't care about your GPU count. They care about how long they wait. A 60-second generation that starts immediately feels faster than a 30-second generation that sits in a queue for 2 minutes.

I built the queue to show real-time position and estimated time. When a user submits a generation, they see:

Position: 3 of 12
Estimated start: ~45 seconds
Estimated completion: ~90 seconds after start

This reduced "stuck generation" support tickets by 80%. Users weren't actually stuck -- they just couldn't tell if anything was happening.

Lesson 3: VRAM Is Not RAM

This one cost me weeks. GPU memory (VRAM) doesn't behave like system RAM. You can't just malloc and free -- model weights stay resident, context windows accumulate, and fragmentation is brutal.

On a 32GB card, loading a video generation pipeline takes ~18GB. That leaves 14GB for the actual generation. Try to generate a long 1080p video and you'll OOM (out of memory) immediately because intermediate tensors need temporary space.

The solution: treat each GPU as having two "slots" -- one for the loaded model and one for the active generation. Never try to run two heavy generations simultaneously on one card. It's slower overall but eliminates the OOM crashes that would brick a card for 30+ seconds on recovery.

Lesson 4: Safety Cannot Be an Afterthought

This is the one I got wrong initially, and I'm including it because I think a lot of self-hosters are making the same mistake.

When you run your own inference, you're responsible for content safety. There's no API provider filtering prompts for you. I learned this the hard way when I discovered that my initial regex-based content filter was trivially bypassable.

Now there's a dedicated GPU running nothing but a safety classifier. Every prompt goes through it before touching any generation model. Every generated image goes through a vision-based content scanner before being served. It adds latency, and it's worth it.

If you're self-hosting generative AI and your safety layer is just regex patterns, you have no safety layer.

Lesson 5: The Business Model Has to Match the Architecture

Self-hosting means your costs are fixed. You pay for hardware and power whether you have 100 users or 100,000. This inverts the typical SaaS economics -- you need high utilization, not high margins.

That's why ZSky's free tier is generous (200 credits + 100 daily). Empty GPUs don't save money. Full GPUs don't cost more. The optimal strategy is maximum utilization with premium tiers for priority queue access and higher resolutions.

What I'd Do Differently

Start with fewer GPUs. I bought all seven upfront. I should have started with three, validated the product, then scaled. Two sat idle for months.
Build monitoring first. I added real monitoring six months in. Those six months were full of mystery crashes I couldn't diagnose because I had no data.
Don't build the queue system yourself. I wrote a custom job queue because "how hard can it be?" It took four rewrites before it was stable. Use Redis Queue or Celery. Seriously.

The Numbers Today

43,715 registered users
3,000+ new signups per day
1080p video with audio generation
Average generation time: 45-60 seconds
Monthly infrastructure cost: ~$800
Equivalent cloud cost at current usage: ~$180,000/month

Self-hosting isn't right for every AI product. But if your product is compute-heavy, your users expect fast results, and you have the tolerance for hardware maintenance -- it's worth considering before you sign that cloud contract.

ZSky AI is a free AI image and video generator at zsky.ai. No signup required to try it.

DEV Community