horus he

Posted on Mar 3

From Prompt to Production: A Practical Architecture for Chat-Based AI Image Generation

#ai #webdev #nanobanana #nextjs

When we started building Banana AI, we faced a familiar challenge. Users wanted to generate images by describing what they needed, but most existing AI image tools required learning prompt syntax, navigating complex interfaces, or tolerating long waits with no feedback.

The solution we built centers on a chat-based interface where users describe their needs in plain language, refine through conversation, and receive production-ready images. This approach works well for our users, but getting there required solving several architectural problems around async processing, state management, and cost optimization.

This article walks through the practical decisions we made and the patterns that have worked in production of the ai image generator.

The Core Problem: AI Generation Is Slow

AI image generation takes anywhere from 2 to 20 seconds depending on the model and resolution. That's too long for a synchronous HTTP request. Users expect responsive interfaces, and holding a connection open for 20 seconds creates poor experiences and infrastructure problems.

The standard solution is async processing: accept a request, queue it, process it in the background, and notify the user when complete. But implementing this well requires thinking through several connected systems.

Architecture Overview

Our stack runs on Cloudflare's edge infrastructure:

Next.js 15 with App Router for the frontend
Cloudflare Workers for serverless compute
Cloudflare D1 for relational data (sessions, messages, user state)
Cloudflare Durable Objects for real-time prediction state
Cloudflare R2 for image storage
Cloudflare Queues for async job processing

The flow looks like this:

User sends a message through the chat interface
Frontend posts to our API as FormData (supports image uploads)
API authenticates, creates or resumes a session, and queues the generation job
Worker picks up the job, calls the AI model API, and streams progress events
Frontend receives real-time updates via polling
When complete, the image uploads to R2 and the user sees the result

This architecture keeps the frontend responsive while handling long-running AI tasks in the background.

State Management: Sessions and Messages

Chat-based generation differs from single-prompt tools because context matters. A user might say "make it darker," and the system needs to understand what "it" refers to.

We model this with two core entities:

Sessions

A session represents a conversation. It tracks:

User ID (if authenticated)
Current model selection (users can switch models mid-conversation)
Creation timestamp and last activity
Soft delete flag for user-initiated cleanup

// Simplified schema
const chatSessions = sqliteTable('chat_sessions', {
  uuid: text('uuid').primaryKey(),
  userId: text('user_id').notNull(),
  modelId: text('model_id').notNull(),
  createdAt: integer('created_at').notNull(),
  updatedAt: integer('updated_at').notNull(),
  deletedAt: integer('deleted_at'),
});

Messages

Each message belongs to a session and carries:

Role: user or assistant
Content: text, plus optional media attachments
Generation parameters: model used, aspect ratio, resolution
Status: pending, processing, completed, failed

The message history provides context for future generations. When a user says "make the background blue," we include the relevant conversation context when calling the AI model.

Credit System: Reservations and Confirmations

We use a credit-based pricing model. Different models cost different amounts (Nano Banana costs 5 credits, Nano Banana Pro costs 10-20 credits depending on resolution).

The challenge is handling credits atomically when AI generation can fail at multiple points:

User submits request
We reserve credits (deduct from balance but mark as "pending")
AI generation runs
If successful: confirm the reservation (credits become spent)
If failed: release the reservation (credits return to balance)

This reservation pattern prevents the worst case: charging users for failed generations.

We implement this with Durable Objects, which provide strongly consistent state. Each prediction gets its own Durable Object instance that tracks:

interface PredictionState {
  predictionId: string;
  userId: string;
  creditsReserved: number;
  status: 'pending' | 'processing' | 'succeeded' | 'failed';
  createdAt: number;
  completedAt?: number;
  resultUrl?: string;
  error?: string;
}

The Durable Object handles the state transitions and ensures credits are either confirmed or released exactly once.

Streaming Progress: Keeping Users Informed

One advantage of chat-based interfaces is that users expect conversational updates. We leverage this by streaming progress events during generation:

understanding: Processing the user's request
refining_prompt: Preparing the optimized prompt
generating: AI model is creating the image
uploading: Saving the result to storage
finalizing: Updating session state

The frontend polls a /prediction/[id] endpoint that returns the current state from the Durable Object. This creates a simple but effective progress indicator.

For longer generations (Nano Banana Pro includes a "thinking" stage that analyzes composition before generating), this feedback is essential. Users can see that progress is happening rather than staring at a static loading spinner.

Model Selection: Flexibility vs. Complexity

We support three AI models with different characteristics:

Model	Speed	Resolution	Cost	Best For
Nano Banana	2-5s	1K	5 credits	Quick drafts
Nano Banana 2	4-6s	Up to 4K	7-14 credits	Balanced work
Nano Banana Pro	10-20s	Up to 4K	10-20 credits	Maximum quality

Users can switch models mid-conversation. This creates flexibility (draft with a fast model, refine with a quality model) but adds complexity to the state management.

We store model selection at the session level, but each message records which model actually produced it. This means a single conversation can include generations from multiple models, and users can see which model produced each result.

Webhook Handling: External Service Integration

We use Replicate as one of our AI model providers. Replicate processes images asynchronously and sends a webhook when complete.

Handling webhooks reliably requires addressing several edge cases:

Idempotency

Webhooks can arrive multiple times. Our handler checks if we've already processed this prediction ID before taking action:

// Simplified handler
export async function POST(request: Request) {
  const { id, status, output } = await request.json();

  // Get the Durable Object for this prediction
  const stub = getPredictionDO(id);
  const state = await stub.getState();

  // Already processed?
  if (state.status !== 'pending' && state.status !== 'processing') {
    return new Response('Already processed', { status: 200 });
  }

  // Update state and confirm or release credits
  if (status === 'succeeded') {
    await stub.markCompleted(output);
    await confirmCredits(state.userId, state.creditsReserved);
  } else {
    await stub.markFailed();
    await releaseCredits(state.userId, state.creditsReserved);
  }

  return new Response('OK', { status: 200 });
}

Timeout Handling

Sometimes webhooks never arrive. We implement a timeout mechanism that checks for stale predictions and either retries or releases reserved credits.

State Recovery

If our service restarts mid-generation, we need to recover. Durable Objects persist their state, so when the service comes back online, we can query for predictions that were in-progress and either resume waiting or initiate recovery.

Cost Optimization: Learning from Production Data

After running this system in production, we've identified several patterns that reduce costs:

Model Routing

Simple prompts often don't need the highest-quality model. We've experimented with automatic model selection based on prompt complexity, though currently we let users choose explicitly. The data suggests we could route 30-40% of requests to cheaper models without noticeable quality impact.

Resolution Defaults

Most users don't need 4K output. We default to 2K, which costs fewer credits and generates faster. Users can upgrade resolution for specific images where it matters.

Caching Similar Requests

Users often regenerate with minor variations ("same but with different text"). We've explored caching intermediate representations, though the complexity hasn't been worth it for our scale yet.

What We'd Do Differently

Building this system taught us a few lessons:

Polling vs. WebSockets

We chose polling for simplicity. It works adequately, but for real-time collaboration features, WebSockets would provide a better experience. The infrastructure complexity is the trade-off.

Queue Visibility

Cloudflare Queues are reliable but debugging failed jobs can be challenging. We added extensive logging to the queue consumer, which helps but isn't as nice as a proper dead-letter queue interface.

Rate Limiting at the Edge

We implement rate limiting with Durable Objects, which works but adds latency to every request. A dedicated rate-limiting service at the edge would be cleaner.

Conclusion

Chat-based AI image generation requires thinking beyond the AI model itself. The surrounding architecture (async processing, state management, credit handling, progress streaming) determines whether the user experience feels smooth or frustrating.

Our approach prioritizes:

Responsiveness: The UI never blocks on long-running AI tasks
Transparency: Users see progress and understand what's happening
Reliability: Credits are handled atomically; failures don't charge users
Flexibility: Users can switch models and refine through conversation

This architecture has handled thousands of generations with consistent uptime. You can see it in action at Banana AI.

If you're building similar systems, I'm interested in hearing about your architectural decisions. What trade-offs have you made between simplicity and features? How do you handle the async nature of AI generation?

Technical Stack Summary

Framework: Next.js 15 with App Router
Runtime: Cloudflare Workers via OpenNext
Database: Cloudflare D1 (SQLite at the edge)
State: Cloudflare Durable Objects
Storage: Cloudflare R2
Queue: Cloudflare Queues
AI Providers: Google Gemini models via Replicate
Authentication: NextAuth v5 with Google OAuth

Tags: #javascript #typescript #ai #cloudflare #architecture #webdev

DEV Community