Biricik Biricik

Posted on Apr 14 • Edited on May 16 • Originally published at zsky.ai

Building a Free AI Image Generator: Architecture Decisions That Kept Us Alive

#tutorial #webdev #machinelearning #ai

When we set out to build ZSky AI — a free AI image generator offering 50 daily generations without requiring signup — we knew the technical challenges would be significant. What we didn't anticipate was how many architecture decisions would come down to "what keeps us from going bankrupt."

This is the story of those decisions, the mistakes we made along the way, and what we'd do differently.

The Core Challenge

The fundamental tension in offering free AI image generation is simple: GPU compute is expensive, and free users don't pay. Every architecture decision we made was filtered through this lens.

Our constraints:

Generate images in under 10 seconds (users won't wait longer)
Support 50 free generations per user per day
Run sustainably without venture capital
Scale without proportional cost increases

Decision 1: Self-Hosted GPUs vs. Cloud APIs

This was the biggest decision we made, and it saved the project.

The cloud API approach would have been simpler to implement. Services like Replicate, RunPod, and various model-hosting providers offer pay-per-generation APIs. The math seemed reasonable at first: $0.01-0.05 per generation.

But when we modeled our target usage — thousands of free generations daily — the monthly cloud bill quickly exceeded $10,000. For a bootstrapped project with a generous free tier, that's unsustainable.

Our approach: self-hosted GPU cluster. We invested in our own hardware. The upfront cost was significant, but the per-generation cost dropped to a fraction of a cent. Here's the rough math:

Cloud API: ~$0.03 per generation
Self-hosted: ~$0.002 per generation (amortized hardware + electricity)
Monthly savings at 10,000 daily generations: ~$8,000

The breakeven point was about 3 months. After that, every generation was dramatically cheaper than cloud.

Trade-offs:

We handle all hardware maintenance, driver updates, and failures
Scaling requires purchasing physical hardware (can't spin up instances on demand)
We need expertise in GPU systems administration
Power and cooling are ongoing concerns

What we'd do differently: We'd start with cloud APIs for the first month to validate demand, then migrate to self-hosted once we had traffic numbers to justify the investment.

Decision 2: Inference Pipeline Architecture

Our inference pipeline went through three major iterations.

Version 1: Synchronous Processing

The naive approach. User submits a prompt, the web server sends it to the GPU, waits for the result, and returns the image. Simple, but terrible under load.

Problems:

Web server threads blocked during generation (8-15 seconds each)
One slow generation blocks others
No graceful degradation under load

Version 2: Queue-Based Architecture

We moved to an asynchronous queue with Redis:

User Request → API Server → Redis Queue → GPU Worker(s) → Result Store → Polling/WebSocket

This separated the request handling from the generation. The API server adds jobs to the queue and returns immediately. GPU workers pull jobs and process them. The client polls or receives WebSocket updates.

Benefits:

API servers handle thousands of concurrent connections
GPU workers process jobs at their own pace
We can prioritize paid users in the queue
Failed generations can be retried automatically

Version 3: Optimized Pipeline with Batching

The current iteration adds intelligent batching. Instead of processing one image at a time per GPU, we batch compatible requests:

# Simplified batching logic
def batch_compatible(requests):
    """Group requests that can share a model load."""
    batches = defaultdict(list)
    for req in requests:
        key = (req.model, req.resolution, req.style_preset)
        batches[key].append(req)
    return batches

When multiple users request images with the same model and similar parameters, we batch them into a single forward pass. This improved throughput by 40-60% depending on the model.

Decision 3: Anonymous Rate Limiting

Offering 50 free daily generations without requiring signup creates an interesting technical challenge: how do you rate-limit users you can't identify?

We use a multi-signal approach:

Layer 1: Nginx rate limiting

limit_req_zone $binary_remote_addr zone=generate:10m rate=2r/s;
limit_req zone=generate burst=5 nodelay;

This catches burst abuse at the proxy layer before it reaches the application.

Layer 2: Application-level tracking

We combine multiple signals into a user identity score:

IP address (primary signal, but unreliable for shared networks)
Browser fingerprint (canvas hash, screen resolution, timezone)
Signed cookie token (tracks daily count)

def get_user_identity(request):
    signals = {
        'ip': request.remote_addr,
        'fingerprint': compute_fingerprint(request),
        'cookie_token': request.cookies.get('zsky_token'),
    }
    return weighted_identity_hash(signals)

Layer 3: Behavioral analysis

Patterns that suggest abuse:

Rapid sequential requests (automated scripting)
Identical prompts repeated many times
Cookie clearing combined with same fingerprint
Multiple IPs from the same fingerprint in rapid succession

We don't block these users immediately — we serve them a gentle message explaining they may have hit their daily limit, with an option to create a free account for guaranteed tracking.

Results: Less than 0.5% of users attempt to game the system, and the GPU cost of occasional extra generations is lower than the engineering cost of perfect enforcement.

Decision 4: Model Serving Strategy

We run multiple diffusion models optimized for different use cases. The challenge is managing GPU memory across models.

Approach: Model hot-swapping with warm pools

Instead of keeping every model loaded on every GPU, we maintain a warm pool of frequently-used models and swap less popular ones on demand:

GPU 0-3: Primary model (always loaded, handles 70% of requests)
GPU 4-5: Secondary models (rotated based on demand)
GPU 6:   Video generation model (loaded on demand)

Model loading takes 10-30 seconds, so we predict demand based on recent request patterns and pre-load models before they're needed. A simple time-series analysis of the last hour's requests tells us which models to keep warm.

Decision 5: CDN and Caching Strategy

Generated images are served through Cloudflare, but the caching strategy is nuanced:

Generated images are cached by hash. If two users submit the same prompt with the same seed, the second request hits cache instead of the GPU.
Cache invalidation is time-based. Images expire after 24 hours to manage storage.
We never cache the generation request itself. Each request must pass through rate limiting.

In practice, cache hit rates are low (prompts are rarely identical), but during viral moments when many users try the same trending prompt, caching prevents GPU overload.

Lessons Learned

Start with the cost model, not the feature set. Every feature we considered was first evaluated against "what does this cost per user per day?"
Imperfect rate limiting beats perfect authentication. A 95% effective anonymous rate limiter with zero friction outperforms a 100% effective system that requires signup.
Batch everything possible. Whether it's inference requests, image processing, or database writes, batching is the single biggest performance optimization available.
Measure real costs, not theoretical costs. Our actual per-generation cost differs from theoretical by about 30% due to failed generations, model loading overhead, and idle GPU time.
Self-hosting is an operations burden. The cost savings are real, but don't underestimate the time spent on hardware maintenance. Budget for it.

Current State

ZSky AI serves thousands of generations daily across text-to-image and image-to-video. Our infrastructure costs are sustainable thanks to the decisions outlined above, and the free tier remains generous enough to provide real value.

If you want to try it: zsky.ai — 50 free generations per day, no signup required.

Happy to answer questions about any of these architectural decisions in the comments.

DEV Community