I Built a Multi-Model AI Image & Video Platform — Here's What I Learned

lsm166 — Sun, 03 May 2026 13:01:49 +0000

A few months ago I started building Bananai — a platform that
lets you generate images, edit photos, and create AI videos all in one place without
switching between five different tools and five different billing dashboards.

Here's what I actually learned integrating 10+ models end-to-end as an indie dev.

Why I built it

My problem was embarrassingly mundane: I needed to test GPT Image 2 against Nano Banana
Pro for an e-commerce client's product shots. I had four browser tabs open, four
different logins, four different credit top-ups, and I was copy-pasting the same prompt
over and over.

The obvious solution was to wrap them in a unified UI. What I thought would take a
weekend ended up being a real product — because model integration is the easy part.
Everything else is the hard part.

1. Picking the right model for the task is not obvious

The first mistake I made was treating "best model = best output for every use case." That's
wrong. Here's how I actually think about it now:

Task	Model choice	Why
Fast social content iteration	Nano Banana 2	2–5s per image, good enough quality
Client deliverables / 4K output	Nano Banana Pro	Max quality, character consistency
Text in images (logos, UI mocks)	GPT Image 2	Best text rendering accuracy by far
Animate a still product photo	Seedance / Veo	Different motion styles, test both
Background removal + style transfer	Nano Banana (editing mode)	Natural language edit instructions

The practical takeaway: expose model selection to the user early. Users figure out
their own preferences faster than any default you set. I wasted a month pre-selecting
defaults before realizing users just want the picker.

2. The cost structure is wilder than you'd expect

Before building this I assumed model pricing was roughly proportional to quality. It's not.

GPT Image 2 — ~$0.006/image at standard quality. Shockingly cheap for what it outputs.
Nano Banana 2 — fast and economical, great for high-volume generation.
Video models — anywhere from 10x to 50x more expensive than image per output second. Budget for this separately. Video is a different product category economically.

The thing that trips up most builders: you need to model your credit burn rate for
different user behavior patterns, not averages. A power user generating 4K video clips
will burn through credits 30x faster than someone doing quick image edits. If you price
on averages, power users kill your margin.

My approach: track generation cost per user session, flag sessions above the 95th
percentile, and cap or up-sell those users. Works better than blanket rate limits.

3. Async generation UX is harder than sync

Most image models are fast enough to feel synchronous (~3–5 seconds). But video generation
can take 20–60 seconds depending on model and duration. That's a different UX contract.

What didn't work:

Spinner with no feedback → users thought it crashed and refreshed
Showing "estimated time" → estimates were wrong enough to frustrate people more than no estimate

What actually worked:

Progress bar that moves in two phases: "queued → processing" with distinct visual states
Showing a low-res preview/thumbnail as soon as it's available while full resolution renders
Email/notification when a long video finishes (reduces page-staring behavior)

For image generation, the UX expectation is basically instant. Anything over 8 seconds
feels broken to users, even if technically the model is just slow. Cache aggressively,
pre-warm where possible.

4. Prompt UX is underrated

Most AI image tools show you a blank text box. That's terrible for conversion and
terrible for retention — new users don't know what to type and leave.

What I added instead:

const promptSuggestions = [
  "🎨 Product Hero Shot",
  "🖼️ Remove Background",
  "🎬 Cinematic Scene",
  "✨ Style Transfer"
];

These aren't just UI sugar. Each one loads a pre-filled prompt template with the
right model pre-selected and the right settings pre-configured. Click "Product Hero
Shot" and you get an image-to-image flow with the right aspect ratio for e-commerce, not
a blank canvas.

Conversion from landing → first generation went up significantly after this. Users who
generate at least one image on their first visit retain at 3x the rate of those who
don't.

5. Multi-model output comparison is the killer feature nobody talks about

The most popular session pattern I see in analytics: user generates with one model, then
immediately tries the same prompt with another model to compare. This is the workflow
professional designers actually use — they're not loyal to a model, they want to see options.

Building side-by-side comparison mode is on the roadmap. If you're building a similar
tool: this is worth prioritizing early. It's also the stickiest feature because it
locks users into your platform (they can't compare across separate tools with one click).

What's live now

Bananai currently has:

Image generation: Nano Banana 2 & Pro, GPT Image 2, Grok Imagine, Midjourney, Seedream, Wan 2.7
Image editing: background removal, style transfer, inpainting, upscaling — all via natural language
Video generation: Veo 3.1, Seedance 2.0, Wan 2.7 Video, Grok Imagine Video
Free credits on sign-up, daily check-in credits, no credit card required to start
GPT Image 2 at $0.006/image — cheapest I've found anywhere

If you're building something with AI image generation or just want to test models
without juggling multiple accounts, give it a try.

What I'd do differently

Start with one model, nail the UX, then add models. I added too many too fast and spread the QA effort too thin.
Instrument cost tracking from day one. I retrofitted it and lost two weeks of data.
Don't underestimate video. It looks like "image but animated" but it's actually a completely different infrastructure, moderation, and pricing problem.

Happy to answer questions about any of this — model integration, credit system design,
or the UX decisions. Drop them in the comments.

Built with Next.js, deployed on Vercel, with Cloudflare for CDN/image resizing.
Backend is a monolith I'm slowly ashamed of.

DEV Community: lsm166