I built an AI photo merging tool after a 20-minute Photoshop failure

kevinbai — Mon, 08 Jun 2026 01:09:17 +0000

A few months ago I had a group photo from a trip where one friend was missing — she'd left early. I opened Photoshop, spent 20 minutes fighting selection masks, and gave up. The result looked terrible.

I figured: this is a problem with a clear solution. Someone should make it easy.

Validating before building

Before writing a line of code, I went looking for evidence that other people hit this wall too.

It didn't take long. Reddit threads, Quora questions, random Facebook groups — people asking how to add a family member who missed the reunion, how to combine two decent shots into one, how to put their kid next to a theme park character without actually going. Common, everyday problems that Photoshop technically solves but realistically doesn't — not for people without the time or skill to learn it.

That felt like enough signal to start.

What the product does

aiimagecombiner.app lets you upload 2–5 photos, add an optional text prompt to guide the composition, and get back a single blended image. No masking, no layers, no manual color matching.

The use cases that keep coming up:

Adding someone who wasn't in a group photo
Combining product photos without a studio
Before/after comparisons with matched tones
Putting a subject into a different background scene

The hardest part

Getting the blend to look natural. Naively stitching images together fails immediately — the lighting is always slightly off, proportions don't match, edges are obvious. The prompt-guided generation approach helped a lot here, letting the model reinterpret the whole scene rather than just paste things together.

Built on Next.js + Cloudflare Workers, moved fast with Claude Code.

What I'd tell myself before starting

The positioning language matters more than I expected. Users aren't searching for "AI image compositing" — they're searching for "how do I put two photos together." Getting that framing right took longer than it should have.

How AI Translates Manga: The Full Pipeline

kevinbai — Mon, 23 Mar 2026 14:36:10 +0000

Translating manga sounds simple — just read the text and translate it, right? In practice, it's one of the most technically demanding NLP + computer vision problems you can tackle. The text is embedded in images, stylized, often arranged vertically, and packed inside speech bubbles that need to look natural after translation.

In this post I'll walk through the full AI pipeline behind automated manga translation — from raw image to a fully rendered, translated page.

The Full Pipeline

Input Image
    │
    ├─ [Optional] Upscaling
    ├─ 1. Text Detection
    ├─ 2. OCR
    ├─ 3. Textline Merge
    ├─ 4. Translation
    ├─ 5. Inpainting
    └─ 6. Rendering
         │
    Output Image

Six steps. Each one is a non-trivial problem on its own.

Pre-processing: Upscaling

Small panels or low-DPI scans contain text that's only a few pixels tall. Running detection and OCR on 8px-high characters produces garbage. Super-resolution models like Waifu2x — built specifically for anime/manga art — can 2x or 4x the resolution before anything else runs, dramatically improving downstream accuracy.

Step 1: Text Detection

The first challenge is finding where text is. Manga text comes in every orientation, size, and style imaginable — axis-aligned bounding boxes aren't enough.

Modern text detectors predict a pixel-level probability map of "text-ness" across the entire image, then threshold and trace it into polygons. This produces arbitrarily-shaped regions that conform to rotated, curved, or irregularly laid-out text — something a simple rectangle detector can't handle.

Step 2: OCR — Why Generic OCR Fails

Standard OCR tools fail badly on manga. Stylized fonts, vertical text, low contrast against screentone backgrounds — none of this matches what Tesseract or cloud OCR was trained on.

The solution is domain-specific models trained on manga panels. One example: MangaOCR, a ViT-based model fine-tuned specifically on Japanese manga. Because it's seen thousands of speech bubbles during training, it handles stylized lettering and vertical text far better than any general-purpose OCR.

Step 3: Textline Merge — A Graph Theory Problem

Detection gives you individual text lines. But a single speech bubble might contain five detected lines that need to be treated as one block for translation.

Merging by proximity alone fails — nearby lines from different bubbles get incorrectly grouped. The better approach: model it as a graph problem. Each textline is a node; edges connect candidates weighted by distance, font size similarity, and alignment direction. Cut the graph at the right edges and the remaining connected components become coherent text blocks, ready for translation.

Step 4: Translation

With clean text extracted, translation is where you have the most flexibility. You can plug in any API (GPT-4, DeepL, Gemini) or run local models (Meta's NLLB, which covers 200+ languages, or lighter models specialized for Japanese→English).

One interesting pattern: chaining translators. Route text through one model to get an intermediate language, then through a second model specialized in the final target language. Sometimes two mediocre steps beat one expensive one.

Step 5: Inpainting — Erasing the Original Text

Before rendering the translation, the original text has to be erased and the background reconstructed. This is harder than it sounds — the model needs to hallucinate the screentone, hatching, or artwork that was hiding underneath.

LaMa (Large Mask Inpainting) handles this well. Its large receptive field lets it understand the global structure of the image before deciding what to fill in — so it correctly continues a crosshatch pattern or background gradient across the erased region, rather than just blending nearby pixels.

Step 6: Rendering — The Hardest Part Nobody Talks About

Most DIY translation tools fall flat here. Slapping text onto an image is trivial. Making it look like it belongs there is genuinely hard.

A few of the problems the renderer has to solve:

Length mismatch: a 10-character Japanese phrase might translate to 40 English characters. The renderer must dynamically shrink the font, reflow the text, and keep it inside the bubble.
Rotation: dialogue in action panels is often tilted. The text layer is warped to match using a homography transform — the same math behind AR markers and image stitching.
Vertical text: Japanese manga uses top-to-bottom columns. Simply rotating glyphs 90° doesn't produce correct vertical typography — specific characters have dedicated vertical variants that need to be substituted.

Getting all three right simultaneously, for every panel on a page, is what separates a usable translation from a broken one.

Try It Without Setting Any of This Up

Configuring the full stack — models, CUDA, dependencies, API keys — takes real effort. If you just want to see what this pipeline produces on your own manga pages, mangatranslator.me runs it online without any local setup. Upload an image, pick a target language, done.

Final Thoughts

Automated manga translation is a good example of how real-world AI applications are rarely about one model — they're about a sequence of specialized models where each step's output quality determines the ceiling of the next. The rendering step in particular is a reminder that "put text on image" hides a surprising amount of engineering.

DEV Community: kevinbai