Biricik Biricik

Posted on Apr 4 • Edited on May 16 • Originally published at dev.to

Why I Self-Host 7 RTX 5090 GPUs Instead of Using Cloud AI

#ai #gpu #selfhosted #infrastructure

The Short Version

I run seven NVIDIA RTX 5090 GPUs in my home. That's 224 GB of VRAM sitting in a single tower with a 32-core, 64-thread CPU. People on Reddit tell me I'm insane. Cloud providers tell me I'm leaving money on the table. My electricity bill tells me… well, let's not talk about that.

But every morning, when ZSky AI serves thousands of users their first image in under two seconds — with zero cold-start latency, zero API rate limits, and zero permission from anyone else — I know I made the right call.

My name is Cemhan Biricik. I'm a photographer, a two-time National Geographic award winner, an immigrant from Istanbul, and the founder of ZSky AI. This is the story of why I chose to own my AI infrastructure instead of renting it.

Who Am I and Why Do I Care About GPUs?

I've been building computers since the early 2000s. Back then, I ran a company called ICEe PC — custom-built gaming and workstation rigs, back when water cooling was exotic and SLI was the bleeding edge. Hardware has always been my language.

Then life took a turn. I became a professional photographer, shooting campaigns for Versace Mansion, the Waldorf Astoria, St. Regis, and the Miami Dolphins. I won two National Geographic awards. I built Biricik Media, a content studio that generated over 50 million viral views.

But underneath all of that, I have a condition called aphantasia — I literally cannot form mental images. My mind's eye is black. And after a traumatic brain injury that temporarily took my speech, I became obsessed with the idea that technology could bridge the gap between imagination and creation.

That obsession became ZSky AI: an AI creative platform where anyone can generate images, videos, and audio — no design degree required, no subscription wall for basic use.

The Cloud Trap

When I started building ZSky, the obvious path was cloud GPU. Spin up some A100s on AWS or GCP, pay per hour, scale as needed. Every YC blog post says the same thing: don't build infrastructure, build product.

So I did the math.

Cloud GPU Costs for Our Workload

Resource	Cloud (per month)	Self-Hosted (amortized)
7x high-end GPUs (A100/H100 equivalent)	$15,000–$25,000	~$2,500 (power + amortized HW)
Inference latency	200–500ms cold start	<50ms warm
Storage (model weights, outputs)	$500–$1,500	Included (local NVMe)
Bandwidth (serving video)	$1,000–$3,000	Included (Cloudflare tunnel)
Total	$17,000–$30,000/mo	~$2,500/mo

That's not a rounding error. That's a 6–12x cost difference. And it gets worse as you scale: cloud GPU pricing is anti-economies of scale for inference workloads. The more users you serve, the more you pay per user.

Self-hosting flips that. Once the hardware is paid off, marginal cost per user approaches the cost of electricity.

The Real Killer: Cold Starts and Queuing

Cloud GPU instances take 30–120 seconds to spin up. If you keep them warm, you're paying for idle time. If you don't, your users stare at a loading spinner.

With local GPUs, models stay loaded in VRAM. An image generation request hits a warm model and returns in under 2 seconds. Video with audio? 30 seconds for 1080p. No queue, no cold start, no prayer to the AWS spot instance gods.

The Build

Here's what the primary workstation looks like:

GPUs: 7x NVIDIA RTX 5090 (32 GB VRAM each = 224 GB total)
CPU: 32 cores / 64 threads
RAM: High capacity DDR5
Storage: Multi-TB NVMe array
Network: Gigabit LAN with Tailscale overlay for remote management
Cooling: Custom loop + aggressive fan curves (this thing heats my office in winter)

Beyond the primary node, I run a small cluster of additional machines — a couple of RTX 4090 workstations for overflow and testing — all connected via SSH and managed through a unified config.

The entire inference stack runs locally: model loading, request routing, video encoding (with hardware acceleration across all 32 threads), and delivery through Cloudflare tunnels. No Lambda. No SageMaker. No managed anything.

What Self-Hosting Actually Requires

I won't pretend this is easy. Here's what you sign up for:

1. You Are the SRE

When a GPU throws an ECC error at 3 AM, there's no support ticket. You're reflashing firmware in your pajamas. I've had to debug CUDA driver mismatches, thermal throttling under sustained load, and PCIe lane allocation issues that only manifest under 7-GPU configurations.

2. Power and Cooling Are Real Engineering

Seven 5090s under full load draw serious wattage. I had to upgrade my electrical panel and run a dedicated 30A circuit. Cooling is a constant battle — ambient temps in South Florida don't help.

3. You Need to Be a Full-Stack Engineer

I write the inference code, the queue management, the model swapping logic, the video encoding pipelines (always with -threads 32), the monitoring, the alerting. There's no managed service abstracting this away.

4. Redundancy Is Your Problem

Cloud providers give you multi-AZ redundancy by default. I give myself redundancy by having spare GPUs and a failover node. It's not the same, and I've accepted that tradeoff.

Why It's Worth It Anyway

Total Control Over the Stack

I can swap models in minutes. I can test a new diffusion architecture on real traffic with a config change. I don't need to rebuild a Docker container, push it to ECR, update a SageMaker endpoint, and wait 15 minutes. I just… load the model.

Privacy by Default

User images never leave my hardware. There's no S3 bucket to misconfigure, no third-party API logging prompts, no compliance nightmare. The data stays on my NVMe drives, encrypted at rest.

Speed as a Feature

Our users notice the speed. When you're used to cloud AI tools that make you wait 15–30 seconds for an image, getting it in 2 seconds feels like magic. That speed is only possible because the models are always warm, always local.

Long-Term Economics

The hardware pays for itself in 3–4 months compared to equivalent cloud spend. After that, it's almost free inference. For a bootstrapped startup with no VC money, that's the difference between survival and running out of runway.

The Philosophy: Own Your Stack, Control Your Destiny

This goes beyond cost optimization. It's a philosophical position.

When you build on someone else's infrastructure, you're one pricing change away from your business model breaking. AWS can raise prices. NVIDIA can restrict cloud GPU allocations. API providers can change their terms of service overnight.

When you own your hardware, your cost structure is fixed. Your capabilities are known. Your dependencies are minimal. You can make decisions based on what's best for your users, not what's cheapest on your cloud bill.

I learned this lesson the hard way across multiple businesses. With ZSky AI, I decided from day one: if it's core to the product, I own it.

Should You Self-Host?

Honestly? Probably not. If you're a startup doing fewer than 1,000 inference calls per day, cloud is fine. The operational overhead of self-hosting isn't worth it at small scale.

But if you're:

Serving thousands of daily active users
Running inference as your core product (not a feature)
Sensitive to latency
Bootstrapped and watching every dollar
Experienced with hardware and Linux systems administration

…then self-hosting deserves a serious look. The math works, the performance is better, and the independence is liberating.

Links

ZSky AI — The platform I built on this hardware
cemhanbiricik.com — My photography and media portfolio
cemhan.ai — My AI work and projects

If you have questions about self-hosting GPU infrastructure, drop them in the comments. I've made every mistake possible so you don't have to.

Cemhan Biricik is a photographer, AI engineer, and founder of ZSky AI. He previously founded ICEe PC, Biricik Media, and Technologies. He lives in South Florida with his family and an unreasonable number of GPUs.

DEV Community