I spent the last year building an AI video generator that runs on seven GPUs sitting in a room in my house. No AWS. No GCP. No "serverless" anything. Just power-hungry hardware, a lot of thermal paste, and an unreasonable stubbornness about not paying cloud bills.
Here's what I learned building ZSky AI from bare metal to 43,000+ users.
Why Self-Host Video Generation at All?
The short answer: cloud inference for video is obscenely expensive.
A single 1080p AI video generation takes 30-90 seconds of sustained GPU compute. On cloud providers, that's $0.50-$2.00 per generation depending on the model and resolution. When you're giving away 200 free credits to every user plus 100 daily, those numbers become existential fast.
I ran the math at 3,000 signups per day:
Cloud cost per user (conservative):
- 20 generations/day average
- $0.75/generation (mid-tier cloud GPU)
- = $15/user/day
- x 3,000 new users/day
- = $45,000/day in NEW user compute alone
Self-hosted cost per user:
- 7x RTX 5090 @ $2,000 each = $14,000 (one-time)
- Power: ~$400/month
- Amortized over 3 years: ~$800/month total
- = effectively $0.003/generation
That's not a rounding error. That's the difference between burning $1.35M/month and spending $800/month. Even at 10x my current scale, self-hosted wins.
The Architecture (Without the Buzzwords)
The stack is deliberately boring:
[nginx reverse proxy]
|
[Flask API dispatcher] -- routes to GPU workers
|
[7x RTX 5090 workers] -- each runs inference independently
|
[R2 object storage] -- serves generated media via CDN
|
[Supabase] -- auth, credits, user state
No Kubernetes. No Docker orchestrator. No service mesh. Each GPU worker is a Python process that pulls jobs from a queue, runs inference, uploads the result, and reports back. If a worker crashes, it restarts. If it's slow, the dispatcher routes around it.
I tried the "proper" infrastructure approach first -- containerized workers, Kubernetes for orchestration, Prometheus for monitoring. It took three weeks to set up, broke constantly, and added 200ms of overhead to every request. I ripped it all out and replaced it with a 400-line Python dispatcher.
# Simplified version of the dispatch logic
def assign_job(job):
workers = get_available_workers()
if not workers:
return queue_job(job)
# Pick the worker with the most free VRAM
best = min(workers, key=lambda w: w.current_load)
# If even the best worker is >80% loaded, queue it
if best.current_load > 0.8:
return queue_job(job)
return best.submit(job)
The boring approach handles 3,000+ signups per day without incident. The Kubernetes approach couldn't survive a weekend.
Lesson 1: Thermal Management Is Your Actual Bottleneck
Seven 450W GPUs in a residential space produce roughly 3,150 watts of heat continuously. That's like running three space heaters 24/7 in one room.
My first summer, the GPUs started thermal throttling at 2 PM every day. Generation times went from 45 seconds to 3+ minutes. Users thought the service was broken.
The fix wasn't a better cooling solution -- it was scheduling. Heavy batch jobs run overnight when ambient temperature drops. During peak heat hours, the dispatcher automatically shifts to lower-resolution fast-pass generations and queues heavy work for later.
# Thermal-aware scheduling
def get_max_concurrent(gpu_temp, ambient_temp):
if gpu_temp > 82: # Celsius
return 1 # Single job only, let it cool
if ambient_temp > 28:
return max(1, DEFAULT_CONCURRENT - 2)
return DEFAULT_CONCURRENT
Not glamorous. But it works better than a $10,000 cooling system.
Lesson 2: The Queue Is the Product
Users don't care about your GPU count. They care about how long they wait. A 60-second generation that starts immediately feels faster than a 30-second generation that sits in a queue for 2 minutes.
I built the queue to show real-time position and estimated time. When a user submits a generation, they see:
Position: 3 of 12
Estimated start: ~45 seconds
Estimated completion: ~90 seconds after start
This reduced "stuck generation" support tickets by 80%. Users weren't actually stuck -- they just couldn't tell if anything was happening.
Lesson 3: VRAM Is Not RAM
This one cost me weeks. GPU memory (VRAM) doesn't behave like system RAM. You can't just malloc and free -- model weights stay resident, context windows accumulate, and fragmentation is brutal.
On a 32GB card, loading a video generation pipeline takes ~18GB. That leaves 14GB for the actual generation. Try to generate a long 1080p video and you'll OOM (out of memory) immediately because intermediate tensors need temporary space.
The solution: treat each GPU as having two "slots" -- one for the loaded model and one for the active generation. Never try to run two heavy generations simultaneously on one card. It's slower overall but eliminates the OOM crashes that would brick a card for 30+ seconds on recovery.
Lesson 4: Safety Cannot Be an Afterthought
This is the one I got wrong initially, and I'm including it because I think a lot of self-hosters are making the same mistake.
When you run your own inference, you're responsible for content safety. There's no API provider filtering prompts for you. I learned this the hard way when I discovered that my initial regex-based content filter was trivially bypassable.
Now there's a dedicated GPU running nothing but a safety classifier. Every prompt goes through it before touching any generation model. Every generated image goes through a vision-based content scanner before being served. It adds latency, and it's worth it.
If you're self-hosting generative AI and your safety layer is just regex patterns, you have no safety layer.
Lesson 5: The Business Model Has to Match the Architecture
Self-hosting means your costs are fixed. You pay for hardware and power whether you have 100 users or 100,000. This inverts the typical SaaS economics -- you need high utilization, not high margins.
That's why ZSky's free tier is generous (200 credits + 100 daily). Empty GPUs don't save money. Full GPUs don't cost more. The optimal strategy is maximum utilization with premium tiers for priority queue access and higher resolutions.
What I'd Do Differently
Start with fewer GPUs. I bought all seven upfront. I should have started with three, validated the product, then scaled. Two sat idle for months.
Build monitoring first. I added real monitoring six months in. Those six months were full of mystery crashes I couldn't diagnose because I had no data.
Don't build the queue system yourself. I wrote a custom job queue because "how hard can it be?" It took four rewrites before it was stable. Use Redis Queue or Celery. Seriously.
The Numbers Today
- 43,715 registered users
- 3,000+ new signups per day
- 1080p video with audio generation
- Average generation time: 45-60 seconds
- Monthly infrastructure cost: ~$800
- Equivalent cloud cost at current usage: ~$180,000/month
Self-hosting isn't right for every AI product. But if your product is compute-heavy, your users expect fast results, and you have the tolerance for hardware maintenance -- it's worth considering before you sign that cloud contract.
ZSky AI is a free AI image and video generator at zsky.ai. No signup required to try it.
Top comments (0)