The hard part of national ID OCR isn't the OCR

#api #programming #tutorial #python

You wire up OCR for your KYC flow, point it at a national ID card, and get back a clean { name, idNumber, dateOfBirth }. Ship it. Then you onboard your second country — and it falls apart. Fields you mapped don't exist. The name comes back as garbled Latin. The date of birth says the year 2567.

Here's the thing nobody tells you when you start: the hard part of national ID OCR isn't the OCR. It's that every country's ID is a different document. A model that reads text off a card is table stakes. Turning 30 countries' cards into data your system can actually use is where the work is.

Let me show you the three axes of variation that will bite you, then how to architect so they don't.

Axis 1: the fields are different

There is no universal "national ID" schema, because the cards themselves don't agree on what to print.

A Thai ID card prints the holder's religion.
A German ID card prints height and eye color.
A Chinese ID card prints ethnicity and the issuing authority.

None of these are edge cases — they're core fields on those documents. So the instinct to define one IdCard type with a fixed set of columns is wrong from day one. Either you drop information that some countries consider essential, or you end up with a sparse table full of nulls and country-specific special-casing.

And it's not just which fields exist — it's what they're called and how they're split. The same "name" concept might come back as a single full-name string on one card and as separate given/family fields on another, sometimes in two scripts at once. Your data model has to treat "the field set depends on the country" as a first-class fact, not an afterthought.

Axis 2: the script is different

If your users are global, a lot of their names are not in the Latin alphabet — Chinese, Thai, Arabic, and more.

The naive move is to transliterate everything to Latin "so it's consistent." Don't. Transliteration is lossy and ambiguous: multiple native spellings collapse to the same Latin form, diacritics get dropped, and you can no longer match the name back against the source document or a government database. For KYC specifically, mangling the name defeats the purpose.

The correct approach is to keep the native-script value as printed, and carry a Latin form alongside it when the card itself prints one (many do — they show both). That way you have a local handle for matching against local records and a Latin handle for systems that need ASCII, without throwing away the original.

Axis 3: the format is different — dates and ID numbers

Two formats will surprise you specifically:

Dates aren't all Gregorian. A Thai card prints dates in the Buddhist calendar (BE), which runs 543 years ahead of the Gregorian year and, on top of that, uses Thai numerals rather than Arabic digits. So two things break a naive parser: the digits won't int()-parse at all, and even once you read the number, treating it as a Gregorian year leaves every age check and expiry comparison off by 543 years. The fix is to convert the numerals, subtract 543, and normalize to a single representation (ISO 8601 is the sane choice) — while keeping the original string around for display. (You'll see exactly this — the raw Thai date next to the normalized one — in the response further down.)

ID numbers have structure. A Thai national ID is 13 digits; a Chinese one is 18. Many encode a checksum digit, region codes, or a date. That structure is genuinely useful — you can validate the checksum to catch a misread early — but note the responsibility split: an OCR step gives you the number as printed; validating it against each country's rule is logic you own, per country. Don't assume the extraction layer does it for you.

Two ways to deal with this

Once you accept that ID OCR is per-country, you have two implementation paths:

Build the per-country layer yourself. Run a generic OCR/vision model, then write, per country, the field mapping, the script handling, the date-calendar conversion, and the number parsing — plus tests with real sample cards for each. This is doable, but the cost is linear in the number of countries you support, and it never really ends: layouts get redesigned, new document versions ship, a new market means a new adapter.
Use an API that already encodes the per-country knowledge. You hand it an image and tell it which country/document you're reading; it gives back the fields that document actually has, in the right scripts, with dates normalized. You've outsourced the heterogeneity instead of maintaining it.

If you're supporting more than two or three countries, the second path is usually the honest economic call. PicToText's ID Card OCR API is one option built around exactly this per-country model, so the rest of this post shows what that looks like in practice.

What it looks like in practice

You send one image and a country-specific documentType — the documentType is how you select which national format to read. With cURL:

curl -X POST "https://pictotext.io/api/v1/ocr" \
  -H "Authorization: Bearer sk_live_YOUR_API_KEY" \
  -F "image=@id_card.jpg" \
  -F "documentType=th_id_card"

Or the same call in Python:

import requests

with open("id_card.jpg", "rb") as image:
    resp = requests.post(
        "https://pictotext.io/api/v1/ocr",
        headers={"Authorization": "Bearer sk_live_YOUR_API_KEY"},
        files={"image": image},
        data={"documentType": "th_id_card"},  # e.g. th_id_card, de_id_card, cn_id_card
        timeout=30,
    )
resp.raise_for_status()
data = resp.json()
print(data["nameEn"], data["nameTh"], data["birthEn"])

Now look at the two things that make this a per-country response. A Thai card comes back with native + Latin names, a normalized date next to the original, and religion:

{
  "idNumber": "1234567890123",
  "nameTh": "สมชาย ใจดี",
  "nameEn": "SOMCHAI JAIDEE",
  "birthTh": "๑ มกราคม ๒๕๓๓",
  "birthEn": "1990-01-01",
  "expiryDateEN": "2030-01-01",
  "religion": "พุทธ",
  "address": "123/456 หมู่ 1 ถ.สุขุมวิท เขตคลองเตย กรุงเทพฯ 10110"
}

A German card, same endpoint, returns a different field set entirely — height, eyeColor, placeOfBirth:

{
  "firstName": "ERIKA",
  "lastName": "MÜLLER",
  "documentNumber": "L01X00T28",
  "dateOfBirth": "1990-01-01",
  "nationality": "DEUTSCH",
  "placeOfBirth": "BERLIN",
  "eyeColor": "BLAU",
  "height": "170",
  "address": "PLATZ DER REPUBLIK 1, 11011 BERLIN"
}

Notice what axes 1–3 look like when they're handled: native script preserved (nameTh), Latin alongside (nameEn), the Buddhist-calendar birthdate normalized (birthEn), and each country's own fields returned instead of being flattened away. The exact keys per country are in the field reference — e.g. the Thai ID card page.

Takeaway

If you remember one thing: treat "every country is different" as a first-class design constraint, not something you patch after OCR. Your data model should expect a per-country field set, your name handling should preserve native script, and your dates should be normalized but reversible. Whether you build that layer yourself or use an API that already has it, the teams that get global identity verification right are the ones that designed for the heterogeneity up front.

If you'd rather not maintain a per-country adapter for every market, the docs are open on GitHub and there's a quickstart to try it on your own sample cards.