Till Felippi

Posted on Mar 9

We Started With an Image Editor. We Ended Up Building Infrastructure

#ai

Last year, we started building an object-based image editor.

The idea was ambitious: Let users work with images as layered compositions instead of flat outputs. Move objects around, replace parts of a scene, generate edits with actual control. Something closer to creative work than typing a prompt and crossing your fingers.

Sounded hard, but we had a few promising ideas.

What we didn't expect: the editor wasn't the hard part. Everything underneath it was.

Once we got deeper in, we realized building on top of image models is way messier than it looks from the outside. Not because the models are bad — because the infrastructure around them is fragmented in ways that hurt the moment you try to ship something real.

"Just send a prompt, get an image back". Sure, in a demo.

In practice, one workflow might chain four or five different steps. Text-to-image generation. Inpainting. Object segmentation. Upscaling. Maybe a style transfer pass. And the moment you pick the best model for each step, you're talking to three different providers with three different APIs.

That's where it stops being fun.

Replicate returns prediction objects you poll every two seconds. fal.ai gives you a webhook or a queue-based response. OpenAI returns base64 or URLs depending on what you ask for. Some providers want width and height, others want aspect_ratio, others want an enum of preset sizes. One provider rounds megapixels up for billing (thanks, BFL). Another bills per image. Error shapes are different. Timeout behavior is different. Rate limit headers are different.

We wrote adapters. Then more adapters. Our actual product logic was slowly drowning in provider-specific glue code.

I think people underestimate this about AI image products. The difficulty isn't model quality (well, not only), it's integration complexity.

You don't notice it in a weekend prototype. You notice it when you need fallbacks. When a provider goes down at 2am and you're scrambling to reroute traffic. When you want to A/B test two models and realize your code assumes one specific provider's response format everywhere. When you look at your codebase and half of it is just translating between API dialects. Even with the help of LLMs this becomes cumbersome at some point.

I remember thinking: why is there no OpenRouter-style layer for image models?

With text models, developers already expect abstraction. You pick a model, call a unified API, swap providers without rewriting anything. OpenRouter, LiteLLM... that idea is well-established. Nobody builds a new HTTP client for every LLM provider anymore.

But in image generation? We couldn't find anything like it. There are providers with a good selection, but none of them had all models we needed and also 1 provider will not solve resilience (once your product is productive and people rely on it). Which felt backwards, because image workflows arguably need this abstraction more than text does.

So we built the thing we wished existed

That's how Lumenfall started. Not as a grand plan to build an AI gateway. More like: we kept solving the same infrastructure problem over and over while trying to build our actual product, and eventually it became obvious that this should be its own layer.

The gateway runs as a Cloudflare Worker — Hono for the HTTP framework, Zod for request validation, R2 for media storage. We went with an OpenAI-compatible API surface (/v1/images/generations, /v1/images/edits) so you can point existing SDKs at it and things just work.

Under the hood, every request goes through a normalization step. Provider X wants image_size as a union type. Provider Y wants aspect_ratio as a string. Provider Z wants width and height as integers. We handle all of that in typed adapters so your request always looks the same:

  curl https://api.lumenfall.ai/openai/v1/images/generations \
    -H "Authorization: Bearer lum_xxx" \
    -d '{
      "model": "flux-schnell",
      "prompt": "A sunset over mountains",
      "size": "16:9",
      "output_format": "webp"
    }'

Same shape whether it hits fal, Replicate, Vertex AI, or any of the providers we support. The response comes back with cost metadata, provider info, and timing.

We also added priority-based routing with automatic failover. If the primary provider times out or returns a 503, the request rolls to the next one. Rate-limited? It goes into a retry queue with a short delay instead of failing outright. Auth errors or bad requests don't trigger failover — no point retrying those elsewhere.

We can also do weighted traffic splitting within a priority tier, which turned out to be useful for gradual provider migrations and cost optimization.

The management layer is a Rails 8 app with PostgreSQL, Hotwire for the frontend, Tailwind v4 with daisyUI.

There's also a cool arena feature where we run blind comparisons between models to actually measure which one is better at what. Turns out opinions about model quality shift a lot once you do side-by-side tests at scale. Check out our leaderboard and make sure to cast a few votes in the arena while you're there.

What I keep coming back to

Looking back, the pattern we hit isn't unique to image generation. It shows up everywhere in AI. People talk about models - which makes sense, models are the exciting part. But the moment you try to build something real on top of them, you run into all the stuff that isn't the model. The routing. The normalization. The failure handling. The cost tracking. The boring infrastructure work that turns powerful models into something you can actually depend on.

We didn't start with a thesis about routing image models. We started trying to build a product. Then we kept tripping over the same infrastructure gaps. Eventually it became clear that the missing piece wasn't another app-level feature — it was the layer underneath.

That's what Lumenfall is. The thing we kept wishing existed while we were building something else.

Our north star now is creating a broader developer platform for AI media generation that includes observability, with fine-tuned vision models as a judge to monitor production traffic. If you have any wishes feel free to contact me or comment on the article. We're in full builder mode and open for any idea.

Top comments (1)

Michael Wirth • Mar 13

Happy to be on the journey with you!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.