DEV Community

horus he
horus he

Posted on

From Prompt to Production: A Practical Architecture for Chat-Based AI Image Generation

When we started building Banana AI, we faced a familiar challenge. Users wanted to generate images by describing what they needed, but most existing AI image tools required learning prompt syntax, navigating complex interfaces, or tolerating long waits with no feedback.

The solution we built centers on a chat-based interface where users describe their needs in plain language, refine through conversation, and receive production-ready images. This approach works well for our users, but getting there required solving several architectural problems around async processing, state management, and cost optimization.

This article walks through the practical decisions we made and the patterns that have worked in production of the ai image generator.


The Core Problem: AI Generation Is Slow

AI image generation takes anywhere from 2 to 20 seconds depending on the model and resolution. That's too long for a synchronous HTTP request. Users expect responsive interfaces, and holding a connection open for 20 seconds creates poor experiences and infrastructure problems.

The standard solution is async processing: accept a request, queue it, process it in the background, and notify the user when complete. But implementing this well requires thinking through several connected systems.

Architecture Overview

Our stack runs on Cloudflare's edge infrastructure:

  • Next.js 15 with App Router for the frontend
  • Cloudflare Workers for serverless compute
  • Cloudflare D1 for relational data (sessions, messages, user state)
  • Cloudflare Durable Objects for real-time prediction state
  • Cloudflare R2 for image storage
  • Cloudflare Queues for async job processing

The flow looks like this:

  1. User sends a message through the chat interface
  2. Frontend posts to our API as FormData (supports image uploads)
  3. API authenticates, creates or resumes a session, and queues the generation job
  4. Worker picks up the job, calls the AI model API, and streams progress events
  5. Frontend receives real-time updates via polling
  6. When complete, the image uploads to R2 and the user sees the result

This architecture keeps the frontend responsive while handling long-running AI tasks in the background.


State Management: Sessions and Messages

Chat-based generation differs from single-prompt tools because context matters. A user might say "make it darker," and the system needs to understand what "it" refers to.

We model this with two core entities:

Sessions

A session represents a conversation. It tracks:

  • User ID (if authenticated)
  • Current model selection (users can switch models mid-conversation)
  • Creation timestamp and last activity
  • Soft delete flag for user-initiated cleanup
// Simplified schema
const chatSessions = sqliteTable('chat_sessions', {
  uuid: text('uuid').primaryKey(),
  userId: text('user_id').notNull(),
  modelId: text('model_id').notNull(),
  createdAt: integer('created_at').notNull(),
  updatedAt: integer('updated_at').notNull(),
  deletedAt: integer('deleted_at'),
});
Enter fullscreen mode Exit fullscreen mode

Messages

Each message belongs to a session and carries:

  • Role: user or assistant
  • Content: text, plus optional media attachments
  • Generation parameters: model used, aspect ratio, resolution
  • Status: pending, processing, completed, failed

The message history provides context for future generations. When a user says "make the background blue," we include the relevant conversation context when calling the AI model.


Credit System: Reservations and Confirmations

We use a credit-based pricing model. Different models cost different amounts (Nano Banana costs 5 credits, Nano Banana Pro costs 10-20 credits depending on resolution).

The challenge is handling credits atomically when AI generation can fail at multiple points:

  1. User submits request
  2. We reserve credits (deduct from balance but mark as "pending")
  3. AI generation runs
  4. If successful: confirm the reservation (credits become spent)
  5. If failed: release the reservation (credits return to balance)

This reservation pattern prevents the worst case: charging users for failed generations.

We implement this with Durable Objects, which provide strongly consistent state. Each prediction gets its own Durable Object instance that tracks:

interface PredictionState {
  predictionId: string;
  userId: string;
  creditsReserved: number;
  status: 'pending' | 'processing' | 'succeeded' | 'failed';
  createdAt: number;
  completedAt?: number;
  resultUrl?: string;
  error?: string;
}
Enter fullscreen mode Exit fullscreen mode

The Durable Object handles the state transitions and ensures credits are either confirmed or released exactly once.


Streaming Progress: Keeping Users Informed

One advantage of chat-based interfaces is that users expect conversational updates. We leverage this by streaming progress events during generation:

  • understanding: Processing the user's request
  • refining_prompt: Preparing the optimized prompt
  • generating: AI model is creating the image
  • uploading: Saving the result to storage
  • finalizing: Updating session state

The frontend polls a /prediction/[id] endpoint that returns the current state from the Durable Object. This creates a simple but effective progress indicator.

For longer generations (Nano Banana Pro includes a "thinking" stage that analyzes composition before generating), this feedback is essential. Users can see that progress is happening rather than staring at a static loading spinner.


Model Selection: Flexibility vs. Complexity

We support three AI models with different characteristics:

Model Speed Resolution Cost Best For
Nano Banana 2-5s 1K 5 credits Quick drafts
Nano Banana 2 4-6s Up to 4K 7-14 credits Balanced work
Nano Banana Pro 10-20s Up to 4K 10-20 credits Maximum quality

Users can switch models mid-conversation. This creates flexibility (draft with a fast model, refine with a quality model) but adds complexity to the state management.

We store model selection at the session level, but each message records which model actually produced it. This means a single conversation can include generations from multiple models, and users can see which model produced each result.


Webhook Handling: External Service Integration

We use Replicate as one of our AI model providers. Replicate processes images asynchronously and sends a webhook when complete.

Handling webhooks reliably requires addressing several edge cases:

Idempotency

Webhooks can arrive multiple times. Our handler checks if we've already processed this prediction ID before taking action:

// Simplified handler
export async function POST(request: Request) {
  const { id, status, output } = await request.json();

  // Get the Durable Object for this prediction
  const stub = getPredictionDO(id);
  const state = await stub.getState();

  // Already processed?
  if (state.status !== 'pending' && state.status !== 'processing') {
    return new Response('Already processed', { status: 200 });
  }

  // Update state and confirm or release credits
  if (status === 'succeeded') {
    await stub.markCompleted(output);
    await confirmCredits(state.userId, state.creditsReserved);
  } else {
    await stub.markFailed();
    await releaseCredits(state.userId, state.creditsReserved);
  }

  return new Response('OK', { status: 200 });
}
Enter fullscreen mode Exit fullscreen mode

Timeout Handling

Sometimes webhooks never arrive. We implement a timeout mechanism that checks for stale predictions and either retries or releases reserved credits.

State Recovery

If our service restarts mid-generation, we need to recover. Durable Objects persist their state, so when the service comes back online, we can query for predictions that were in-progress and either resume waiting or initiate recovery.


Cost Optimization: Learning from Production Data

After running this system in production, we've identified several patterns that reduce costs:

Model Routing

Simple prompts often don't need the highest-quality model. We've experimented with automatic model selection based on prompt complexity, though currently we let users choose explicitly. The data suggests we could route 30-40% of requests to cheaper models without noticeable quality impact.

Resolution Defaults

Most users don't need 4K output. We default to 2K, which costs fewer credits and generates faster. Users can upgrade resolution for specific images where it matters.

Caching Similar Requests

Users often regenerate with minor variations ("same but with different text"). We've explored caching intermediate representations, though the complexity hasn't been worth it for our scale yet.


What We'd Do Differently

Building this system taught us a few lessons:

Polling vs. WebSockets

We chose polling for simplicity. It works adequately, but for real-time collaboration features, WebSockets would provide a better experience. The infrastructure complexity is the trade-off.

Queue Visibility

Cloudflare Queues are reliable but debugging failed jobs can be challenging. We added extensive logging to the queue consumer, which helps but isn't as nice as a proper dead-letter queue interface.

Rate Limiting at the Edge

We implement rate limiting with Durable Objects, which works but adds latency to every request. A dedicated rate-limiting service at the edge would be cleaner.


Conclusion

Chat-based AI image generation requires thinking beyond the AI model itself. The surrounding architecture (async processing, state management, credit handling, progress streaming) determines whether the user experience feels smooth or frustrating.

Our approach prioritizes:

  1. Responsiveness: The UI never blocks on long-running AI tasks
  2. Transparency: Users see progress and understand what's happening
  3. Reliability: Credits are handled atomically; failures don't charge users
  4. Flexibility: Users can switch models and refine through conversation

This architecture has handled thousands of generations with consistent uptime. You can see it in action at Banana AI.

If you're building similar systems, I'm interested in hearing about your architectural decisions. What trade-offs have you made between simplicity and features? How do you handle the async nature of AI generation?


Technical Stack Summary

  • Framework: Next.js 15 with App Router
  • Runtime: Cloudflare Workers via OpenNext
  • Database: Cloudflare D1 (SQLite at the edge)
  • State: Cloudflare Durable Objects
  • Storage: Cloudflare R2
  • Queue: Cloudflare Queues
  • AI Providers: Google Gemini models via Replicate
  • Authentication: NextAuth v5 with Google OAuth

Tags: #javascript #typescript #ai #cloudflare #architecture #webdev

Top comments (0)