Andrew Wang

Posted on Apr 17

Why Every AI Image Generator Fails at Text (And One That Finally Doesn't)

#ai #machinelearning #webdev #opensource

Why Every AI Image Generator Fails at Text (And One That Finally Doesn't)

If you've spent any time with AI image generators, you've probably run into the same frustrating pattern: you ask for a poster with some text on it, and you get back an image where the letters look like they were drawn by someone who has only heard what writing looks like.

FLUX.1 produces garbled glyphs. Stable Diffusion smears characters together. Midjourney treats your carefully written headline as decorative noise. And if you're working with Chinese, Japanese, or Korean — forget it. You'll get something that vaguely resembles the characters you wanted, surrounded by confident-looking nonsense.

This has been a known limitation for years. Most teams just work around it in post-processing.

Why Text Rendering Is Hard for Diffusion Models

The core issue is how diffusion models learn. They're trained on image-caption pairs, optimized to capture broad visual patterns — composition, style, color, form. Text inside images is treated as just another visual texture, not as structured symbolic information.

To render text correctly, a model needs to understand that A is not just a triangular shape — it's a specific symbol with specific strokes, that must appear consistently regardless of font, color, or context. For Chinese, this is compounded by the sheer complexity of the character set (50,000+ characters vs. 26 letters).

Most models are trained with enough English text in training data to get close on short Latin strings. Chinese, not so much.

Enter ERNIE-Image

Baidu recently open-sourced ERNIE-Image, an 8B parameter image generation model that was built from the ground up with text rendering as a first-class requirement.

The benchmarks are notable:

LongTextBench: 0.9733 — the highest score on this benchmark for accurate text rendering in generated images
GENEval: 0.8856 — strong general image quality

But benchmarks aside, the practical difference is immediately obvious when you try it. Ask it to generate a poster with a Chinese headline and English subtitle — you get clean, legible text. Ask for a product label with specific copy — you get the actual words you typed.

Here are some examples of what it can do:

Bilingual poster generation:
Prompt: A elegant tea ceremony poster with Chinese title "品茗时光" and English subtitle "Art of Tea", minimalist style, warm tones

The model correctly renders both scripts, maintains proper stroke order for the Chinese characters, and integrates the text naturally into the composition.

Product packaging:
Prompt: Luxury skincare product, clean white label, serif font, product name "LUMIÈRE" with French-style typography

Compare this to FLUX.1 on the same prompt — you'll see the difference immediately.

The Technical Architecture

What makes ERNIE-Image different architecturally? A few things:

Character-aware training: The model was trained with explicit supervision on character-level correctness, not just perceptual image quality.
Bilingual text handling: Native support for mixed Chinese-English prompts and outputs. You can specify text placement, font style, and language in the same prompt.
Structured layout understanding: Beyond just rendering individual characters, it understands layout concepts — columns, headlines, captions, callouts. This makes it genuinely useful for poster and infographic generation.
Apache 2.0 license: Fully open source, free for commercial use. No usage restrictions.

How to Use It

Option 1: Via fal.ai API

The model is hosted on fal.ai with a queue-based API:

# Submit a generation job
curl -X POST https://queue.fal.run/fal-ai/ernie-image/turbo \
  -H "Authorization: Key $FAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A product poster with bold headline TEXT HERE, minimalist design",
    "image_size": "landscape_4_3"
  }'

# Poll for result
curl "https://queue.fal.run/fal-ai/ernie-image/requests/{request_id}/status" \
  -H "Authorization: Key $FAL_KEY"

Get a key at fal.ai — they have a free tier.

Option 2: No-code web app

If you just want to experiment without setting up API keys, ernie-image.com has a web interface with both Turbo (fast) and Standard (higher quality) modes. Free credits on sign-up, no credit card required.

Option 3: Self-host

The model weights are on Hugging Face. You'll need a reasonably sized GPU (the 8B model fits on a 24GB VRAM card with some quantization), but for production use the API route is probably easier.

Prompt Tips That Actually Work

After testing this extensively, a few things I've found make a big difference:

For text rendering:
Put the exact text you want in quotes within the prompt. The model seems to treat quoted strings as explicit text instructions.

A modern tech conference poster, title: "DEVCON 2025", date: "October 15-17", location: "San Francisco"

For Chinese text:
Be explicit about the script: "Chinese characters" or just write the Chinese directly in the prompt. The model handles both.

海报设计，标题"人工智能峰会"，副标题"2025年技术前沿"，现代简约风格

For mixed bilingual:
Specify both languages and their visual hierarchy:

Bilingual product label, Chinese main text "自然护肤" (large, top), English tagline "Pure Nature Skincare" (small, bottom), minimal design

For structured layouts:
Describe the layout explicitly — the model respects compositional instructions better than most alternatives:

4-panel comic strip layout, each panel with caption text at bottom, consistent character design

When to Use It (and When Not To)

Good fit:

Posters and marketing materials with text
Bilingual content (Chinese/English)
Product packaging mockups
Infographics and diagrams with labels
Comic/manga style with speech bubbles

Not the best choice:

Pure photorealism without text (FLUX.1 Realism is stronger here)
Complex scenes with many elements and no text (Midjourney still wins on aesthetic)
Logos (use a vector tool)

Wrapping Up

Text rendering has been the dirty secret of AI image generation for years — impressive in demos, frustrating in practice. ERNIE-Image is the first open model I've used where I could actually write "put this specific text here" and have it work reliably.

For anyone building tools that generate marketing content, localized assets, or any kind of designed output with text — this is worth evaluating. The API is straightforward, the Apache 2.0 license removes the IP headaches, and the bilingual support opens up use cases that simply weren't viable before.

Have you run into the text-rendering problem in your own projects? Curious what workarounds others have been using — drop a comment.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.