When we started building Banana AI, we faced a familiar challenge. Users wanted to generate images by describing what they needed, but most existing AI image tools required learning prompt syntax, navigating complex interfaces, or tolerating long waits with no feedback.
The solution we built centers on a chat-based interface where users describe their needs in plain language, refine through conversation, and receive production-ready images. This approach works well for our users, but getting there required solving several architectural problems around async processing, state management, and cost optimization.
This article walks through the practical decisions we made and the patterns that have worked in production of the ai image generator.
The Core Problem: AI Generation Is Slow
AI image generation takes anywhere from 2 to 20 seconds depending on the model and resolution. That's too long for a synchronous HTTP request. Users expect responsive interfaces, and holding a connection open for 20 seconds creates poor experiences and infrastructure problems.
The standard solution is async processing: accept a request, queue it, process it in the background, and notify the user when complete. But implementing this well requires thinking through several connected systems.
Architecture Overview
Our stack runs on Cloudflare's edge infrastructure:
- Next.js 15 with App Router for the frontend
- Cloudflare Workers for serverless compute
- Cloudflare D1 for relational data (sessions, messages, user state)
- Cloudflare Durable Objects for real-time prediction state
- Cloudflare R2 for image storage
- Cloudflare Queues for async job processing
The flow looks like this:
- User sends a message through the chat interface
- Frontend posts to our API as FormData (supports image uploads)
- API authenticates, creates or resumes a session, and queues the generation job
- Worker picks up the job, calls the AI model API, and streams progress events
- Frontend receives real-time updates via polling
- When complete, the image uploads to R2 and the user sees the result
This architecture keeps the frontend responsive while handling long-running AI tasks in the background.
State Management: Sessions and Messages
Chat-based generation differs from single-prompt tools because context matters. A user might say "make it darker," and the system needs to understand what "it" refers to.
We model this with two core entities:
Sessions
A session represents a conversation. It tracks:
- User ID (if authenticated)
- Current model selection (users can switch models mid-conversation)
- Creation timestamp and last activity
- Soft delete flag for user-initiated cleanup
// Simplified schema
const chatSessions = sqliteTable('chat_sessions', {
uuid: text('uuid').primaryKey(),
userId: text('user_id').notNull(),
modelId: text('model_id').notNull(),
createdAt: integer('created_at').notNull(),
updatedAt: integer('updated_at').notNull(),
deletedAt: integer('deleted_at'),
});
Messages
Each message belongs to a session and carries:
- Role:
userorassistant - Content: text, plus optional media attachments
- Generation parameters: model used, aspect ratio, resolution
- Status:
pending,processing,completed,failed
The message history provides context for future generations. When a user says "make the background blue," we include the relevant conversation context when calling the AI model.
Credit System: Reservations and Confirmations
We use a credit-based pricing model. Different models cost different amounts (Nano Banana costs 5 credits, Nano Banana Pro costs 10-20 credits depending on resolution).
The challenge is handling credits atomically when AI generation can fail at multiple points:
- User submits request
- We reserve credits (deduct from balance but mark as "pending")
- AI generation runs
- If successful: confirm the reservation (credits become spent)
- If failed: release the reservation (credits return to balance)
This reservation pattern prevents the worst case: charging users for failed generations.
We implement this with Durable Objects, which provide strongly consistent state. Each prediction gets its own Durable Object instance that tracks:
interface PredictionState {
predictionId: string;
userId: string;
creditsReserved: number;
status: 'pending' | 'processing' | 'succeeded' | 'failed';
createdAt: number;
completedAt?: number;
resultUrl?: string;
error?: string;
}
The Durable Object handles the state transitions and ensures credits are either confirmed or released exactly once.
Streaming Progress: Keeping Users Informed
One advantage of chat-based interfaces is that users expect conversational updates. We leverage this by streaming progress events during generation:
-
understanding: Processing the user's request -
refining_prompt: Preparing the optimized prompt -
generating: AI model is creating the image -
uploading: Saving the result to storage -
finalizing: Updating session state
The frontend polls a /prediction/[id] endpoint that returns the current state from the Durable Object. This creates a simple but effective progress indicator.
For longer generations (Nano Banana Pro includes a "thinking" stage that analyzes composition before generating), this feedback is essential. Users can see that progress is happening rather than staring at a static loading spinner.
Model Selection: Flexibility vs. Complexity
We support three AI models with different characteristics:
| Model | Speed | Resolution | Cost | Best For |
|---|---|---|---|---|
| Nano Banana | 2-5s | 1K | 5 credits | Quick drafts |
| Nano Banana 2 | 4-6s | Up to 4K | 7-14 credits | Balanced work |
| Nano Banana Pro | 10-20s | Up to 4K | 10-20 credits | Maximum quality |
Users can switch models mid-conversation. This creates flexibility (draft with a fast model, refine with a quality model) but adds complexity to the state management.
We store model selection at the session level, but each message records which model actually produced it. This means a single conversation can include generations from multiple models, and users can see which model produced each result.
Webhook Handling: External Service Integration
We use Replicate as one of our AI model providers. Replicate processes images asynchronously and sends a webhook when complete.
Handling webhooks reliably requires addressing several edge cases:
Idempotency
Webhooks can arrive multiple times. Our handler checks if we've already processed this prediction ID before taking action:
// Simplified handler
export async function POST(request: Request) {
const { id, status, output } = await request.json();
// Get the Durable Object for this prediction
const stub = getPredictionDO(id);
const state = await stub.getState();
// Already processed?
if (state.status !== 'pending' && state.status !== 'processing') {
return new Response('Already processed', { status: 200 });
}
// Update state and confirm or release credits
if (status === 'succeeded') {
await stub.markCompleted(output);
await confirmCredits(state.userId, state.creditsReserved);
} else {
await stub.markFailed();
await releaseCredits(state.userId, state.creditsReserved);
}
return new Response('OK', { status: 200 });
}
Timeout Handling
Sometimes webhooks never arrive. We implement a timeout mechanism that checks for stale predictions and either retries or releases reserved credits.
State Recovery
If our service restarts mid-generation, we need to recover. Durable Objects persist their state, so when the service comes back online, we can query for predictions that were in-progress and either resume waiting or initiate recovery.
Cost Optimization: Learning from Production Data
After running this system in production, we've identified several patterns that reduce costs:
Model Routing
Simple prompts often don't need the highest-quality model. We've experimented with automatic model selection based on prompt complexity, though currently we let users choose explicitly. The data suggests we could route 30-40% of requests to cheaper models without noticeable quality impact.
Resolution Defaults
Most users don't need 4K output. We default to 2K, which costs fewer credits and generates faster. Users can upgrade resolution for specific images where it matters.
Caching Similar Requests
Users often regenerate with minor variations ("same but with different text"). We've explored caching intermediate representations, though the complexity hasn't been worth it for our scale yet.
What We'd Do Differently
Building this system taught us a few lessons:
Polling vs. WebSockets
We chose polling for simplicity. It works adequately, but for real-time collaboration features, WebSockets would provide a better experience. The infrastructure complexity is the trade-off.
Queue Visibility
Cloudflare Queues are reliable but debugging failed jobs can be challenging. We added extensive logging to the queue consumer, which helps but isn't as nice as a proper dead-letter queue interface.
Rate Limiting at the Edge
We implement rate limiting with Durable Objects, which works but adds latency to every request. A dedicated rate-limiting service at the edge would be cleaner.
Conclusion
Chat-based AI image generation requires thinking beyond the AI model itself. The surrounding architecture (async processing, state management, credit handling, progress streaming) determines whether the user experience feels smooth or frustrating.
Our approach prioritizes:
- Responsiveness: The UI never blocks on long-running AI tasks
- Transparency: Users see progress and understand what's happening
- Reliability: Credits are handled atomically; failures don't charge users
- Flexibility: Users can switch models and refine through conversation
This architecture has handled thousands of generations with consistent uptime. You can see it in action at Banana AI.
If you're building similar systems, I'm interested in hearing about your architectural decisions. What trade-offs have you made between simplicity and features? How do you handle the async nature of AI generation?
Technical Stack Summary
- Framework: Next.js 15 with App Router
- Runtime: Cloudflare Workers via OpenNext
- Database: Cloudflare D1 (SQLite at the edge)
- State: Cloudflare Durable Objects
- Storage: Cloudflare R2
- Queue: Cloudflare Queues
- AI Providers: Google Gemini models via Replicate
- Authentication: NextAuth v5 with Google OAuth
Tags: #javascript #typescript #ai #cloudflare #architecture #webdev
Top comments (0)