AI image models still can't spell. Stop asking them to.

#ai #devtools #tutorial #programming

AI image models still can't spell. Stop asking them to.

My video pipeline needed one image this week: a dark code editor showing a short config file — a title, three headings, four bullet lines. Simple, right?

The automated art fetch returned a photo of Saturn. An actual JWST photo of the planet, complete with somebody else's caption baked into the corner. My QA gate caught it one frame before publish.

Here's the part that matters: the fix was not "use a better image model with a better prompt." I had a FLUX endpoint sitting right there. I didn't use it, and if your pipeline puts words inside AI-generated images, you shouldn't either.

Text inside generated images is a dice roll

This isn't vibes; it's one of the most-documented weaknesses in image generation.

Rendering legible, correctly spelled text is a known standing challenge for diffusion models — there's a whole research lineage (TextDiffuser, Glyph-ByT5, GlyphControl) devoted just to making models spell, and a 2025 stress-test benchmark (STRICT) showing spelling accuracy "remains unsatisfactory even in state-of-the-art models."
FLUX.1 — genuinely one of the better open models at typography — lands around 60% first-attempt accuracy on short text. Ask for a magazine cover that says "FUTURE DESIGN" and some fraction of the time you get "FUTUR3 DESLGN."

60% is a fun demo. It's a completely unshippable defect rate for anything with your product's name on it. If a beat in my video shows a file called CLAUDE.md and the frame renders CLUADE.rnd, that frame doesn't ship — and my critic gate treats one garbled character as an automatic kill. So the generation either gets retried in a loop with a human squinting at every candidate... or you stop playing the game.

The fix: separate the layers

The rule I now run every media pipeline on:

The model paints pixels. Code paints letters.

Anything decorative — backgrounds, texture, scenes, mood — AI-generate freely. Anything that must be read — filenames, headings, code, UI labels, prices — gets rendered programmatically, where a font file guarantees every glyph. Then composite.

For my config-card frame I skipped the model entirely, because the whole image is text. One ffmpeg command draws it letter-perfect, deterministically, in about a second, for free:

// make-card.mjs — letter-perfect "code editor" card, no AI in the loop
import { spawnSync } from "child_process";
import fs from "fs"; import os from "os"; import path from "path";

const W = 1620, H = 2880;                        // 1.5x a 1080x1920 frame
const rows = [
  ["# CLAUDE.md",                  "0x818CF8"],
  ["", ""],
  ["## Commands",                  "0x818CF8"],
  ["- Build / run: npm run dev",   "0xE6EDF3"],
  ["- Test: npm test",             "0xE6EDF3"],
];

const dir = fs.mkdtempSync(path.join(os.tmpdir(), "card-"));
fs.copyFileSync("C:/Windows/Fonts/consola.ttf", path.join(dir, "mono.ttf"));

const filters = [
  // editor panel + title bar
  "drawbox=x=96:y=225:w=1428:h=1470:color=0x161B22:t=fill",
  "drawbox=x=96:y=225:w=1428:h=102:color=0x21262D:t=fill",
];
rows.forEach(([text, color], i) => {
  if (!text) return;
  const f = `t${i}.txt`;
  fs.writeFileSync(path.join(dir, f), text, "utf8");   // textfile dodges escaping hell
  filters.push(
    `drawtext=fontfile=mono.ttf:textfile=${f}:fontsize=60:fontcolor=${color}` +
    `:x=186:y=${393 + i * 93}`
  );
});

spawnSync("ffmpeg", ["-y", "-f", "lavfi", "-i", `color=c=0x0D1117:s=${W}x${H}`,
  "-vf", filters.join(","), "-frames:v", "1", "card.png"], { cwd: dir });

Three details that took me real defects to learn:

Use textfile=, not text=. Inline drawtext escaping (colons, quotes, percent signs) will eat an afternoon. A UTF-8 file per line just works — including em-dashes and ● characters.
Render at 1.5–2x your target frame. If the image gets a Ken Burns zoom or any rescale downstream, type rendered at exact size goes soft. Oversample and let the pipeline downscale.
One drawtext per color. drawtext is single-color. Group your lines by color (headings vs. body) instead of trying to be clever inside one filter.

Need richer layouts than ffmpeg can draw — flexbox, gradients, rounded corners? Same principle, nicer tools: Satori (Vercel's HTML/CSS-to-SVG library, the thing behind their OG-image service) gives you real layout with guaranteed glyphs, and node-canvas or Sharp will composite your text layer over an AI background. The compositing is the point: generated pixels under, deterministic type over.

The general rule (this is about more than images)

The Saturn incident and the spelling research are the same lesson wearing two hats:

Never let a probabilistic component produce something a deterministic component can produce.

Text in images is the cleanest example — a font file has a 0% typo rate, forever, for free — but the pattern repeats all over agent pipelines:

Don't let the model do arithmetic in prose; make it call a calculator.
Don't let it "remember" your test command; pin it in a config file it reads every session.
Don't let it re-fetch "a nice background" at render time; pin the exact asset path so every re-render is reproducible.

Save the model for what only the model can do. Everything else, write code — code that spells.

I build automated content pipelines with hard QA gates — every frame of every render gets read by a critic before anything ships. The Saturn frame is real and so is the kill rule that caught it. If you want more field notes like this, follow along.

Top comments (1)

Alex Shev • Jul 6

The spelling issue is a good reminder that image models are not layout engines. For production assets, text should stay in a controllable design layer unless the image model is being used only for rough ideation.