Gerus Lab

Posted on Mar 5 • Originally published at gerus-lab.com

Your AI Image Pipeline Will Break in Production — Here's How We Fixed Ours

#ai #webdev #programming #javascript

Your AI Image Pipeline Will Break in Production — Here's How We Fixed Ours

Everyone's building AI-powered apps in 2026. Few talk about what happens when your "call OpenAI and return the result" approach meets real users.

We built an AI interior design SaaS that generates room redesigns using OpenAI's image APIs. In development, everything worked. In production with real users hitting the generate button, everything broke.

Here's what went wrong and the architecture we built to fix it.

The Problem: AI APIs Are Not REST APIs

When you build a typical CRUD app, your API calls take 50-200ms. You call a database, get a response, return it. Simple.

AI image generation is different:

Latency: 15-60 seconds per request
Rate limits: OpenAI enforces strict RPM and TPM limits
Failures: Network timeouts, 429s, 500s are routine, not exceptional
Cost: Each failed request that gets retried costs real money
Concurrency: 50 users hitting "Generate" simultaneously will destroy your throughput

We learned this the hard way. Our first production deployment handled exactly 3 concurrent users before falling over.

The Naive Approach (What We Started With)

// Don't do this in production
@Post('generate')
async generateDesign(@Body() dto: GenerateDto) {
  const result = await this.openai.images.generate({
    model: 'dall-e-3',
    prompt: dto.prompt,
    size: '1024x1024',
  });

  return { imageUrl: result.data[0].url };
}

This code has at least five production-killing problems:

Request timeout: Most load balancers kill connections after 30s. Your 45s image generation dies silently.
No retry logic: A transient 429 means a lost generation for the user.
Memory pressure: Each pending request holds a connection and memory. 50 concurrent = OOM.
No rate limit awareness: You'll burn through your OpenAI quota in minutes.
No observability: When it breaks, you have no idea why.

The Architecture That Actually Works

After iterating through three major rewrites, here's what we landed on:

User Request → API → BullMQ Queue → Worker Pool → OpenAI API
                ↓                        ↓
           Job ID returned          Result → Redis → Webhook/SSE

The key insight: decouple the request from the execution. The user gets an immediate response (a job ID), and the actual AI work happens asynchronously in a controlled worker pool.

Step 1: The Queue Layer

// generation.module.ts
import { BullModule } from '@nestjs/bullmq';

@Module({
  imports: [
    BullModule.registerQueue({
      name: 'image-generation',
      defaultJobOptions: {
        attempts: 3,
        backoff: {
          type: 'exponential',
          delay: 5000,
        },
        removeOnComplete: { age: 3600 },
        removeOnFail: { age: 86400 },
      },
    }),
  ],
})
export class GenerationModule {}

The defaultJobOptions here are critical:

3 attempts with exponential backoff: A 429 at t=0 retries at t=5s, then t=25s, then t=125s. By then, the rate limit window has usually reset.
removeOnComplete after 1 hour: Don't let Redis fill up with completed jobs.
removeOnFail after 24 hours: Keep failed jobs around long enough to debug.

Step 2: The Controller (Fast Response)

@Post('generate')
async generateDesign(@Body() dto: GenerateDto, @Req() req) {
  const job = await this.generationQueue.add('generate-image', {
    userId: req.user.id,
    prompt: dto.prompt,
    style: dto.style,
    roomType: dto.roomType,
    createdAt: Date.now(),
  }, {
    priority: this.getUserPriority(req.user),
    jobId: `gen-${req.user.id}-${Date.now()}`,
  });

  return {
    jobId: job.id,
    status: 'queued',
    estimatedWaitSeconds: await this.estimateWait(),
  };
}

private async estimateWait(): Promise<number> {
  const waiting = await this.generationQueue.getWaitingCount();
  const active = await this.generationQueue.getActiveCount();
  // Each job takes ~30s average, we run 3 workers
  return Math.ceil((waiting / 3) * 30);
}

The user gets a response in <100ms. They get a job ID and an estimated wait time. The frontend can now show a progress indicator instead of a spinning wheel that might die.

Step 3: The Worker (Where the Magic Happens)

@Processor('image-generation', {
  concurrency: 3,
  limiter: {
    max: 10,
    duration: 60000, // 10 jobs per minute max
  },
})
export class GenerationWorker extends WorkerHost {
  async process(job: Job<GenerationPayload>): Promise<GenerationResult> {
    const { userId, prompt, style, roomType } = job.data;

    await job.updateProgress(10);

    // Build the prompt with guardrails
    const engineeredPrompt = this.buildPrompt(prompt, style, roomType);

    await job.updateProgress(20);

    try {
      const result = await this.callOpenAIWithCircuitBreaker(
        engineeredPrompt,
        job,
      );

      await job.updateProgress(80);

      // Store result and notify user
      const savedUrl = await this.storageService.upload(result.imageBuffer);
      await this.notifyUser(userId, job.id, savedUrl);

      await job.updateProgress(100);

      return { imageUrl: savedUrl, generatedAt: Date.now() };
    } catch (error) {
      // Classify the error for retry decisions
      if (this.isRetryable(error)) {
        throw error; // BullMQ will retry based on config
      }
      // Non-retryable: log and fail permanently
      await this.notifyUserFailure(userId, job.id, error.message);
      throw new UnrecoverableError(error.message);
    }
  }

  private isRetryable(error: any): boolean {
    if (error.status === 429) return true;  // Rate limited
    if (error.status === 500) return true;  // Server error
    if (error.status === 503) return true;  // Service unavailable
    if (error.code === 'ETIMEDOUT') return true;
    return false;
  }
}

Three important patterns here:

Concurrency control: concurrency: 3 means only 3 images generate simultaneously. This prevents both OOM and API abuse.
Rate limiter: BullMQ's built-in limiter caps throughput at 10 jobs/minute, staying well within OpenAI's limits.
Error classification: Not all errors deserve a retry. A 400 (bad prompt) will never succeed on retry. UnrecoverableError tells BullMQ to fail immediately.

Step 4: The Circuit Breaker

This is the piece most tutorials skip. When OpenAI goes down (and it does), you don't want 500 jobs hammering a dead endpoint.

import CircuitBreaker from 'opossum';

private createCircuitBreaker() {
  this.breaker = new CircuitBreaker(
    async (prompt: string) => {
      return this.openai.images.generate({
        model: 'dall-e-3',
        prompt,
        size: '1024x1024',
        response_format: 'b64_json',
      });
    },
    {
      timeout: 90000,        // 90s before we consider it failed
      errorThresholdPercentage: 50,  // Open circuit after 50% failures
      resetTimeout: 30000,   // Try again after 30s
      volumeThreshold: 5,    // Need at least 5 requests before tripping
    },
  );

  this.breaker.on('open', () => {
    this.logger.warn('Circuit breaker OPEN — OpenAI appears down');
    this.metricsService.increment('circuit_breaker.open');
  });

  this.breaker.on('halfOpen', () => {
    this.logger.log('Circuit breaker HALF-OPEN — testing recovery');
  });

  this.breaker.on('close', () => {
    this.logger.log('Circuit breaker CLOSED — OpenAI recovered');
  });
}

When the circuit opens, jobs stay in the queue instead of burning through retries. Once OpenAI recovers, the circuit closes and processing resumes automatically.

Step 5: Real-Time Status Updates

Users need to know what's happening. We use Server-Sent Events for this:

@Sse('status/:jobId')
async streamStatus(@Param('jobId') jobId: string): Observable<MessageEvent> {
  return new Observable((subscriber) => {
    const checkInterval = setInterval(async () => {
      const job = await this.generationQueue.getJob(jobId);
      if (!job) {
        subscriber.complete();
        clearInterval(checkInterval);
        return;
      }

      const state = await job.getState();
      const progress = job.progress;

      subscriber.next({
        data: { state, progress, result: job.returnvalue },
      } as MessageEvent);

      if (state === 'completed' || state === 'failed') {
        subscriber.complete();
        clearInterval(checkInterval);
      }
    }, 2000);

    return () => clearInterval(checkInterval);
  });
}

The frontend connects to this SSE endpoint and shows real-time progress: queued → active (10% → 20% → 80% → 100%) → completed.

The Numbers

After deploying this architecture for our AI interior design platform, here's what changed:

Metric	Before (naive)	After (queue-based)
Concurrent users supported	3	200+
Failed generations (lost)	~15%	<0.5%
Average response time (initial)	30-60s (blocking)	<100ms
OpenAI rate limit hits (user-facing)	Daily	Zero
Monthly cost waste from failed retries	~$200	~$8

The API response time dropped from 30-60 seconds of blocking to under 100ms. Users see immediate feedback. Failed generations dropped from 15% to under 0.5% because retries actually work now.

Lessons Learned

1. Treat AI APIs like unreliable external services, because they are. They're closer to payment gateways than database queries. Design accordingly.

2. Backpressure is your friend. Without concurrency limits, a traffic spike will eat your entire OpenAI budget in minutes. BullMQ's limiter prevents this naturally.

3. Progress updates matter more than speed. Users tolerate 45-second waits when they see a progress bar. They abandon after 10 seconds of a blank spinner.

4. UnrecoverableError saves money. Don't retry bad prompts. Classify errors aggressively.

5. Monitor the queue, not just the API. Queue depth, processing time, and failure rate tell you more about system health than HTTP status codes.

Quick Start

If you want to implement this pattern in your own project:

npm install @nestjs/bullmq bullmq opossum
npm install -D @types/opossum

The three files you need: a module registering the queue, a controller that enqueues and returns fast, and a worker with retry logic and circuit breaking. Start with concurrency: 1 and scale up based on your OpenAI tier limits.

Wrapping Up

The gap between "AI demo" and "AI product" is mostly infrastructure. The AI part (calling OpenAI) is the easy part. Making it reliable, observable, and cost-efficient at scale is where the real engineering lives.

We built this pattern while developing an AI-powered interior design SaaS at Gerus-lab, where users generate room redesigns using AI inpainting. The architecture has been running in production handling thousands of generations with near-zero lost jobs.

If you're building something similar, steal this pattern. It works.

Check out our work at gerus-lab.com

DEV Community

Your AI Image Pipeline Will Break in Production — Here's How We Fixed Ours

Your AI Image Pipeline Will Break in Production — Here's How We Fixed Ours

The Problem: AI APIs Are Not REST APIs

The Naive Approach (What We Started With)

The Architecture That Actually Works

Step 1: The Queue Layer

Step 2: The Controller (Fast Response)

Step 3: The Worker (Where the Magic Happens)

Step 4: The Circuit Breaker

Step 5: Real-Time Status Updates

The Numbers

Lessons Learned

Quick Start

Wrapping Up

Top comments (0)