Everyone's building AI-powered apps in 2026. Few talk about what happens when your "call OpenAI and return the result" approach meets real users.
We built an AI interior design SaaS that generates room redesigns using OpenAI's image APIs. In development, everything worked. In production with real users hitting the generate button, everything broke.
Here's what went wrong and the architecture we built to fix it.
The Problem: AI APIs Are Not REST APIs
When you build a typical CRUD app, your API calls take 50-200ms. You call a database, get a response, return it. Simple.
AI image generation is different:
- Latency: 15-60 seconds per request
- Rate limits: OpenAI enforces strict RPM and TPM limits
- Failures: Network timeouts, 429s, 500s are routine, not exceptional
- Cost: Each failed request that gets retried costs real money
- Concurrency: 50 users hitting "Generate" simultaneously will destroy your throughput
We learned this the hard way. Our first production deployment handled exactly 3 concurrent users before falling over.
The Naive Approach (What We Started With)
// Don't do this in production
@Post('generate')
async generateDesign(@Body() dto: GenerateDto) {
const result = await this.openai.images.generate({
model: 'dall-e-3',
prompt: dto.prompt,
size: '1024x1024',
});
return { imageUrl: result.data[0].url };
}
This code has at least five production-killing problems:
- Request timeout: Most load balancers kill connections after 30s. Your 45s image generation dies silently.
- No retry logic: A transient 429 means a lost generation for the user.
- Memory pressure: Each pending request holds a connection and memory. 50 concurrent = OOM.
- No rate limit awareness: You'll burn through your OpenAI quota in minutes.
- No observability: When it breaks, you have no idea why.
The Architecture That Actually Works
After iterating through three major rewrites, here's what we landed on:
User Request → API → BullMQ Queue → Worker Pool → OpenAI API
↓ ↓
Job ID returned Result → Redis → Webhook/SSE
The key insight: decouple the request from the execution. The user gets an immediate response (a job ID), and the actual AI work happens asynchronously in a controlled worker pool.
Step 1: The Queue Layer
// generation.module.ts
import { BullModule } from '@nestjs/bullmq';
@Module({
imports: [
BullModule.registerQueue({
name: 'image-generation',
defaultJobOptions: {
attempts: 3,
backoff: {
type: 'exponential',
delay: 5000,
},
removeOnComplete: { age: 3600 },
removeOnFail: { age: 86400 },
},
}),
],
})
export class GenerationModule {}
The defaultJobOptions here are critical:
- 3 attempts with exponential backoff: A 429 at t=0 retries at t=5s, then t=25s, then t=125s. By then, the rate limit window has usually reset.
- removeOnComplete after 1 hour: Don't let Redis fill up with completed jobs.
- removeOnFail after 24 hours: Keep failed jobs around long enough to debug.
Step 2: The Controller (Fast Response)
@Post('generate')
async generateDesign(@Body() dto: GenerateDto, @Req() req) {
const job = await this.generationQueue.add('generate-image', {
userId: req.user.id,
prompt: dto.prompt,
style: dto.style,
roomType: dto.roomType,
createdAt: Date.now(),
}, {
priority: this.getUserPriority(req.user),
jobId: `gen-${req.user.id}-${Date.now()}`,
});
return {
jobId: job.id,
status: 'queued',
estimatedWaitSeconds: await this.estimateWait(),
};
}
private async estimateWait(): Promise<number> {
const waiting = await this.generationQueue.getWaitingCount();
const active = await this.generationQueue.getActiveCount();
// Each job takes ~30s average, we run 3 workers
return Math.ceil((waiting / 3) * 30);
}
The user gets a response in <100ms. They get a job ID and an estimated wait time. The frontend can now show a progress indicator instead of a spinning wheel that might die.
Step 3: The Worker (Where the Magic Happens)
@Processor('image-generation', {
concurrency: 3,
limiter: {
max: 10,
duration: 60000, // 10 jobs per minute max
},
})
export class GenerationWorker extends WorkerHost {
async process(job: Job<GenerationPayload>): Promise<GenerationResult> {
const { userId, prompt, style, roomType } = job.data;
await job.updateProgress(10);
// Build the prompt with guardrails
const engineeredPrompt = this.buildPrompt(prompt, style, roomType);
await job.updateProgress(20);
try {
const result = await this.callOpenAIWithCircuitBreaker(
engineeredPrompt,
job,
);
await job.updateProgress(80);
// Store result and notify user
const savedUrl = await this.storageService.upload(result.imageBuffer);
await this.notifyUser(userId, job.id, savedUrl);
await job.updateProgress(100);
return { imageUrl: savedUrl, generatedAt: Date.now() };
} catch (error) {
// Classify the error for retry decisions
if (this.isRetryable(error)) {
throw error; // BullMQ will retry based on config
}
// Non-retryable: log and fail permanently
await this.notifyUserFailure(userId, job.id, error.message);
throw new UnrecoverableError(error.message);
}
}
private isRetryable(error: any): boolean {
if (error.status === 429) return true; // Rate limited
if (error.status === 500) return true; // Server error
if (error.status === 503) return true; // Service unavailable
if (error.code === 'ETIMEDOUT') return true;
return false;
}
}
Three important patterns here:
-
Concurrency control:
concurrency: 3means only 3 images generate simultaneously. This prevents both OOM and API abuse. - Rate limiter: BullMQ's built-in limiter caps throughput at 10 jobs/minute, staying well within OpenAI's limits.
-
Error classification: Not all errors deserve a retry. A 400 (bad prompt) will never succeed on retry.
UnrecoverableErrortells BullMQ to fail immediately.
Step 4: The Circuit Breaker
This is the piece most tutorials skip. When OpenAI goes down (and it does), you don't want 500 jobs hammering a dead endpoint.
import CircuitBreaker from 'opossum';
private createCircuitBreaker() {
this.breaker = new CircuitBreaker(
async (prompt: string) => {
return this.openai.images.generate({
model: 'dall-e-3',
prompt,
size: '1024x1024',
response_format: 'b64_json',
});
},
{
timeout: 90000, // 90s before we consider it failed
errorThresholdPercentage: 50, // Open circuit after 50% failures
resetTimeout: 30000, // Try again after 30s
volumeThreshold: 5, // Need at least 5 requests before tripping
},
);
this.breaker.on('open', () => {
this.logger.warn('Circuit breaker OPEN — OpenAI appears down');
this.metricsService.increment('circuit_breaker.open');
});
this.breaker.on('halfOpen', () => {
this.logger.log('Circuit breaker HALF-OPEN — testing recovery');
});
this.breaker.on('close', () => {
this.logger.log('Circuit breaker CLOSED — OpenAI recovered');
});
}
When the circuit opens, jobs stay in the queue instead of burning through retries. Once OpenAI recovers, the circuit closes and processing resumes automatically.
Step 5: Real-Time Status Updates
Users need to know what's happening. We use Server-Sent Events for this:
@Sse('status/:jobId')
async streamStatus(@Param('jobId') jobId: string): Observable<MessageEvent> {
return new Observable((subscriber) => {
const checkInterval = setInterval(async () => {
const job = await this.generationQueue.getJob(jobId);
if (!job) {
subscriber.complete();
clearInterval(checkInterval);
return;
}
const state = await job.getState();
const progress = job.progress;
subscriber.next({
data: { state, progress, result: job.returnvalue },
} as MessageEvent);
if (state === 'completed' || state === 'failed') {
subscriber.complete();
clearInterval(checkInterval);
}
}, 2000);
return () => clearInterval(checkInterval);
});
}
The frontend connects to this SSE endpoint and shows real-time progress: queued → active (10% → 20% → 80% → 100%) → completed.
The Numbers
After deploying this architecture for our AI interior design platform, here's what changed:
| Metric | Before (naive) | After (queue-based) |
|---|---|---|
| Concurrent users supported | 3 | 200+ |
| Failed generations (lost) | ~15% | <0.5% |
| Average response time (initial) | 30-60s (blocking) | <100ms |
| OpenAI rate limit hits (user-facing) | Daily | Zero |
| Monthly cost waste from failed retries | ~$200 | ~$8 |
The API response time dropped from 30-60 seconds of blocking to under 100ms. Users see immediate feedback. Failed generations dropped from 15% to under 0.5% because retries actually work now.
Lessons Learned
1. Treat AI APIs like unreliable external services, because they are. They're closer to payment gateways than database queries. Design accordingly.
2. Backpressure is your friend. Without concurrency limits, a traffic spike will eat your entire OpenAI budget in minutes. BullMQ's limiter prevents this naturally.
3. Progress updates matter more than speed. Users tolerate 45-second waits when they see a progress bar. They abandon after 10 seconds of a blank spinner.
4. UnrecoverableError saves money. Don't retry bad prompts. Classify errors aggressively.
5. Monitor the queue, not just the API. Queue depth, processing time, and failure rate tell you more about system health than HTTP status codes.
Quick Start
If you want to implement this pattern in your own project:
npm install @nestjs/bullmq bullmq opossum
npm install -D @types/opossum
The three files you need: a module registering the queue, a controller that enqueues and returns fast, and a worker with retry logic and circuit breaking. Start with concurrency: 1 and scale up based on your OpenAI tier limits.
Wrapping Up
The gap between "AI demo" and "AI product" is mostly infrastructure. The AI part (calling OpenAI) is the easy part. Making it reliable, observable, and cost-efficient at scale is where the real engineering lives.
We built this pattern while developing an AI-powered interior design SaaS at Gerus-lab, where users generate room redesigns using AI inpainting. The architecture has been running in production handling thousands of generations with near-zero lost jobs.
If you're building something similar, steal this pattern. It works.
Check out our work at gerus-lab.com
Top comments (0)