DEV Community: Fox

The hard part of national ID OCR isn't the OCR

Fox — Fri, 19 Jun 2026 03:32:18 +0000

You wire up OCR for your KYC flow, point it at a national ID card, and get back a clean { name, idNumber, dateOfBirth }. Ship it. Then you onboard your second country — and it falls apart. Fields you mapped don't exist. The name comes back as garbled Latin. The date of birth says the year 2567.

Here's the thing nobody tells you when you start: the hard part of national ID OCR isn't the OCR. It's that every country's ID is a different document. A model that reads text off a card is table stakes. Turning 30 countries' cards into data your system can actually use is where the work is.

Let me show you the three axes of variation that will bite you, then how to architect so they don't.

Axis 1: the fields are different

There is no universal "national ID" schema, because the cards themselves don't agree on what to print.

A Thai ID card prints the holder's religion.
A German ID card prints height and eye color.
A Chinese ID card prints ethnicity and the issuing authority.

None of these are edge cases — they're core fields on those documents. So the instinct to define one IdCard type with a fixed set of columns is wrong from day one. Either you drop information that some countries consider essential, or you end up with a sparse table full of nulls and country-specific special-casing.

And it's not just which fields exist — it's what they're called and how they're split. The same "name" concept might come back as a single full-name string on one card and as separate given/family fields on another, sometimes in two scripts at once. Your data model has to treat "the field set depends on the country" as a first-class fact, not an afterthought.

Axis 2: the script is different

If your users are global, a lot of their names are not in the Latin alphabet — Chinese, Thai, Arabic, and more.

The naive move is to transliterate everything to Latin "so it's consistent." Don't. Transliteration is lossy and ambiguous: multiple native spellings collapse to the same Latin form, diacritics get dropped, and you can no longer match the name back against the source document or a government database. For KYC specifically, mangling the name defeats the purpose.

The correct approach is to keep the native-script value as printed, and carry a Latin form alongside it when the card itself prints one (many do — they show both). That way you have a local handle for matching against local records and a Latin handle for systems that need ASCII, without throwing away the original.

Axis 3: the format is different — dates and ID numbers

Two formats will surprise you specifically:

Dates aren't all Gregorian. A Thai card prints dates in the Buddhist calendar (BE), which runs 543 years ahead of the Gregorian year and, on top of that, uses Thai numerals rather than Arabic digits. So two things break a naive parser: the digits won't int()-parse at all, and even once you read the number, treating it as a Gregorian year leaves every age check and expiry comparison off by 543 years. The fix is to convert the numerals, subtract 543, and normalize to a single representation (ISO 8601 is the sane choice) — while keeping the original string around for display. (You'll see exactly this — the raw Thai date next to the normalized one — in the response further down.)

ID numbers have structure. A Thai national ID is 13 digits; a Chinese one is 18. Many encode a checksum digit, region codes, or a date. That structure is genuinely useful — you can validate the checksum to catch a misread early — but note the responsibility split: an OCR step gives you the number as printed; validating it against each country's rule is logic you own, per country. Don't assume the extraction layer does it for you.

Two ways to deal with this

Once you accept that ID OCR is per-country, you have two implementation paths:

Build the per-country layer yourself. Run a generic OCR/vision model, then write, per country, the field mapping, the script handling, the date-calendar conversion, and the number parsing — plus tests with real sample cards for each. This is doable, but the cost is linear in the number of countries you support, and it never really ends: layouts get redesigned, new document versions ship, a new market means a new adapter.
Use an API that already encodes the per-country knowledge. You hand it an image and tell it which country/document you're reading; it gives back the fields that document actually has, in the right scripts, with dates normalized. You've outsourced the heterogeneity instead of maintaining it.

If you're supporting more than two or three countries, the second path is usually the honest economic call. PicToText's ID Card OCR API is one option built around exactly this per-country model, so the rest of this post shows what that looks like in practice.

What it looks like in practice

You send one image and a country-specific documentType — the documentType is how you select which national format to read. With cURL:

curl -X POST "https://pictotext.io/api/v1/ocr" \
  -H "Authorization: Bearer sk_live_YOUR_API_KEY" \
  -F "image=@id_card.jpg" \
  -F "documentType=th_id_card"

Or the same call in Python:

import requests

with open("id_card.jpg", "rb") as image:
    resp = requests.post(
        "https://pictotext.io/api/v1/ocr",
        headers={"Authorization": "Bearer sk_live_YOUR_API_KEY"},
        files={"image": image},
        data={"documentType": "th_id_card"},  # e.g. th_id_card, de_id_card, cn_id_card
        timeout=30,
    )
resp.raise_for_status()
data = resp.json()
print(data["nameEn"], data["nameTh"], data["birthEn"])

Now look at the two things that make this a per-country response. A Thai card comes back with native + Latin names, a normalized date next to the original, and religion:

{
  "idNumber": "1234567890123",
  "nameTh": "สมชาย ใจดี",
  "nameEn": "SOMCHAI JAIDEE",
  "birthTh": "๑ มกราคม ๒๕๓๓",
  "birthEn": "1990-01-01",
  "expiryDateEN": "2030-01-01",
  "religion": "พุทธ",
  "address": "123/456 หมู่ 1 ถ.สุขุมวิท เขตคลองเตย กรุงเทพฯ 10110"
}

A German card, same endpoint, returns a different field set entirely — height, eyeColor, placeOfBirth:

{
  "firstName": "ERIKA",
  "lastName": "MÜLLER",
  "documentNumber": "L01X00T28",
  "dateOfBirth": "1990-01-01",
  "nationality": "DEUTSCH",
  "placeOfBirth": "BERLIN",
  "eyeColor": "BLAU",
  "height": "170",
  "address": "PLATZ DER REPUBLIK 1, 11011 BERLIN"
}

Notice what axes 1–3 look like when they're handled: native script preserved (nameTh), Latin alongside (nameEn), the Buddhist-calendar birthdate normalized (birthEn), and each country's own fields returned instead of being flattened away. The exact keys per country are in the field reference — e.g. the Thai ID card page.

Takeaway

If you remember one thing: treat "every country is different" as a first-class design constraint, not something you patch after OCR. Your data model should expect a per-country field set, your name handling should preserve native script, and your dates should be normalized but reversible. Whether you build that layer yourself or use an API that already has it, the teams that get global identity verification right are the ones that designed for the heterogeneity up front.

If you'd rather not maintain a per-country adapter for every market, the docs are open on GitHub and there's a quickstart to try it on your own sample cards.

The problem: too many image models, which one do I use?

Fox — Mon, 01 Jun 2026 09:19:46 +0000

New image-generation models keep landing — Nano Banana, Nano Banana Pro, GPT Image 2, ByteDance's Seedream — and each claims to be the best. But when you actually need one good image, the real questions are:

Same request, different model — how much does the output actually differ?
Re-tuning the prompt for every model is exhausting.
Comparing them means hopping between platforms and signing up over and over.

So I ran a small, reproducible test: fix one prompt, feed it to several models, look at the differences, and boil it down to a selection cheat sheet. To avoid the multi-platform shuffle I did the comparison on cvy.ai, where you switch models from a dropdown on the same prompt — no re-registering, no rewriting.

Step 1: make the prompt reproducible with a template

A fair comparison needs a stable, reusable prompt — otherwise the differences you see are just you writing it differently each time. I break prompts into fixed slots:

[subject] + [style/medium] + [composition/lens] + [light/mood] + [details] + [aspect ratio]

A portrait example:

Subject: a young woman in a casual wool coat, half body, glancing toward camera
Style: cinematic realism, warm film tone
Composition: 85mm telephoto, shallow depth of field
Light: golden-hour, a glowing storefront neon sign reading "CAFE" in the blurred background, rim light on hair
Details: natural skin, windswept hair strands, no over-smoothing
Aspect ratio: 3:4 vertical

Note the neon sign reading "CAFE" — it's deliberate. Text inside an image is one of the clearest ways models differ, so keeping a short word in the scene makes the text-rendering comparison below much more telling.

Flatten that into one continuous prompt and every model gets identical input — that's what makes the comparison mean something.

💡 Tip: instead of staring at an empty prompt box, keep a few reusable templates (portrait / product / scene / social cover) and just swap the subject. cvy.ai ships a set of editable templates I use as a starting point — faster than writing from scratch.

Step 2: same prompt, four models side by side

GPT Image 2 — the winner.

Prompt Adherence: Excellent. The "windswept hair" is dynamic and natural. The pose of the subject glancing over her shoulder adds superb narrative depth.
Text Rendering: The spelling of "CAFE" is perfect. The neon glow and bokeh effect integrate flawlessly with the optical physics of an 85mm lens.
Lighting & Vibe: Perfectly captures the "golden-hour" backlight. The rim light on the hair is spot-on, and the warm film tone is rich and cinematic.
Details & Textures: The skin retains authentic texture without feeling over-smoothed. The wool coat texture is slightly soft but generally solid.
Review: The absolute winner of this test. It completely nails the "cinematic realism" requirement with a perfect balance of atmosphere and accuracy.

Seedream 4.5 — biggest visual impact.

Prompt Adherence: Follows the composition well, though the "windswept hair" feels a bit forced and slightly clumpy rather than naturally blown by the wind.
Text Rendering: "CAFE" is perfectly legible, featuring a very strong and bright neon glow.
Lighting & Vibe: Takes a highly aggressive approach to the "golden-hour" and "rim light" prompts with intense backlighting. This creates massive visual impact, though it sacrifices a bit of the soft film vibe requested.
Details & Textures: The wool coat texture is well-rendered. However, while freckles are present, there is still a faint hint of "AI smoothing" on the skin, making it feel slightly less than 100% natural.
Review: The strongest visual impact. While slightly over-rendered, it is incredibly polished and ready for direct commercial use.

Nano Banana Pro — texture king, lopsided.

Details & Textures: The undisputed king of textures. The coarse, pillowy grain of the wool coat, the natural facial imperfections, and the ultra-realistic skin pores demonstrate terrifying microscopic rendering capabilities.
Text Rendering: "CAFE" is clear, but the neon tube structure looks somewhat stiff and doesn't blend seamlessly into the background lighting.
Lighting & Vibe (Major Deduction): It completely missed the "golden-hour" and "warm film tone" instructions. The lighting is incredibly flat (resembling an overcast afternoon), and the requested rim light on the hair is almost non-existent.
Prompt Adherence: The hair is messy, but it lacks the dynamic motion implied by "windswept."
Review: A hyper-realistic but lopsided specialist. It is visually flawless if you only care about micro-textures, but it completely failed to follow the core lighting and atmospheric instructions.

Nano Banana — baseline.

Prompt Adherence (Major Deduction): The hair is perfectly neat and tucked away, entirely ignoring the "windswept hair" prompt.
Text Rendering: Suffers from AI hallucination; an extra glowing accent mark appeared above the 'E', spelling "CAFÈ" instead of "CAFE".
Lighting & Vibe: The lighting is dull with only a very faint hint of a sunset glow. It lacks cinematic tension, and the background blur feels rigid and artificial.
Details & Textures: The coat feels more like flat felt than coarse wool, and the skin details are the flattest among the four candidates.
Review: Baseline performance. It missed multiple instructions and falls significantly behind the other models in this test.

Step 3: the cheat sheet

Same prompt across the board, distilled (★ = relative strength from this test):

Model	Prompt Adherence	Text Rendering	Lighting & Vibe	Details & Textures	Best for
Nano Banana	⭐⭐	⭐⭐	⭐⭐	⭐⭐⭐	Quick flat drafts
Nano Banana Pro	⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	Hyper-realistic textures
GPT Image 2	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Cinematic storytelling
Seedream 4.5	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	High-impact commercial

The takeaway: no single model wins on every axis. The skill is matching the model to the job — cinematic storytelling from one, raw texture from another, high-impact commercial polish from a third. That's exactly why I didn't want to juggle separate platforms: compare once, and you know which job goes where.

A reusable generation workflow

What the experiment settled into as my default:

Start from a template — pick the closest prompt template, swap the subject.
Pick the model by task — use the cheat sheet; don't default to the same one every time.
Add a reference image when you need direction — when text alone won't land it, upload a reference so the result follows an existing look.
Iterate in small steps — keep prompts, styles, model choices, and good samples together so each render informs the next.

I run the whole flow on cvy.ai — templates, multiple models, and text-to-image / image-to-image in one workspace — which suits a "compare fast, iterate often" habit. The method itself is platform-agnostic, though; any multi-model tool works.

Wrap-up

Don't pick a model by vibes — run one identical prompt across all of them.
Keep prompts structured and templated so comparison is fair and reuse is cheap.
Remember: match the model to the task, don't worship one.

If "which image model should I use?" keeps tripping you up, spend ten minutes running your own same-prompt comparison — the conclusion beats any review. The portrait template and cheat sheet above are yours to copy.