Building a multilingual API serving 39,585 visa pairs in 15 languages

#apigateway #webdev #architecture #i18n

How we structured a visa requirements database across 15 languages with ISO standardization, caching layers, and translation pipelines.

Most travel APIs return data in English. That's fine if your users are American or British. It's useless if they're Japanese, Thai, Arabic-speaking, or any of the 5.5 billion people whose first language isn't English.

When a Vietnamese traveler looks up visa requirements for France, they need the answer in Vietnamese — not just the visa type, but the required documents list, the application process, the embassy information, the travel tips. Translating "Valid passport with 6 months minimum validity" into 15 languages isn't a string replacement. It's a content localization problem at scale.

The Orizn Visa API serves 39,585 passport-destination pairs in 15 languages: English, French, Spanish, Portuguese, German, Japanese, Korean, Chinese, Russian, Italian, Arabic, Hindi, Thai, Vietnamese, and Filipino. Here's how the system is structured.

Why these 15 languages

Not random. These 15 cover approximately 75% of the world's internet users by primary language:

English — lingua franca, baseline
Chinese — 1.1B speakers, China's visa-free program expanding rapidly
Spanish — 550M speakers, Latin America is a growing travel market
Hindi — 600M speakers, Indian passport holders are one of the most visa-restricted populations
Arabic — 370M speakers, Gulf states have some of the fastest-growing passports
Portuguese — 260M speakers, Brazil alone has 210M people
French — 280M speakers, major passport in Africa and Europe
Japanese, Korean — high-value travel markets, strong outbound tourism
Russian — 250M speakers, complex visa landscape post-2022
German, Italian — core EU passports
Thai, Vietnamese, Filipino — Southeast Asia, where digital nomads concentrate

Each language isn't just a translation — it's a market. A Thai translation means Thai travel bloggers can embed our widgets and Thai developers can build apps with localized visa data.

Data architecture

The core data model is simple:

visa_pair {
  passport:     ISO 3166-1 alpha-3  (e.g. "FRA")
  destination:  ISO 3166-1 alpha-3  (e.g. "JPN")
  requirement:  enum                (visa_free | visa_required | e_visa |
                                     visa_on_arrival | eta | no_admission)
  visa_free_days: integer | null
  verified:     boolean
  source_url:   string
  last_updated: timestamp
}

199 passports × 199 destinations = 39,601 theoretical pairs. Some pairs are self-referential (you don't need a visa to visit your own country), bringing the actual count to 39,585.

Each pair has a base record in English. Translations are stored separately:

visa_translation {
  passport:     ISO3
  destination:  ISO3
  lang:         enum (15 values)
  description:  text
  documents:    text[]
  process:      text[]
  tips:         text[]
}

39,585 pairs × 15 languages = 593,775 translation records. That's the real scale of the system.

The translation pipeline

Raw visa data comes from 136 government portals. Most publish in their national language plus English. Some only publish in their national language.

The pipeline has 5 stages:

1. Extraction — Pull structured visa rules from government sources. This is the hardest part. Every government formats their visa information differently. Some have clean REST APIs. Most have PDF documents or HTML pages with inconsistent formatting.

2. Normalization — Map to the 6 standardized requirement types. A government might say "no visa needed for stays under 90 days" — that maps to visa_free with visa_free_days: 90. Another might say "electronic authorization required prior to travel" — that's eta. The mapping isn't always obvious and edge cases are everywhere.

3. English baseline — Generate the English version with all fields: description, documents_required, process, tips, country_info. This is the canonical record that everything else derives from.

4. Translation — Generate the 14 other language versions. This isn't word-for-word translation. Document names, process steps, and tips need to be culturally adapted. "Apply at the embassy" in Japanese includes the Japanese name of the embassy and Japanese-language application forms. "Proof of sufficient funds" in Arabic needs to reflect local banking norms.

5. Verification — Cross-check against at least 2 independent sources per pair. The verified flag indicates whether the data has been confirmed. Unverified pairs are still returned but flagged — better to have data with a caveat than no data at all.

Caching strategy

With 593K+ records and 15 language variants, caching is critical:

Request flow:
Client → API Gateway → Cache (Redis) → Database

Cache key pattern: visa:{passport}:{destination}:{lang}

The TTL strategy is split by volatility:

24 hours for stable pairs — visa types don't change hourly. A visa_free pair that's been stable for 3 years doesn't need real-time freshness.
1 hour for recently changed pairs — if Thailand just modified its policy, the cache needs to reflect that quickly.
No cache for the /changes endpoint — it queries the diff table directly. When someone asks "what changed this week?", they need the latest data, not a cached snapshot.

The /check endpoint (quick check, no documents) is cached aggressively — it returns 5 fields. The /visa endpoint (full details with documents, process, tips) has shorter TTLs because embassies can update document requirements at any time.

ISO standardization decisions

We use ISO 3166-1 alpha-3 exclusively. Not alpha-2 (FR, JP), not country names (France, Japan), not IATA codes (CDG, NRT). Three reasons:

Unambiguous — alpha-3 has no collisions across all 199 countries
Universal — same codes work regardless of language (a Japanese developer sends FRA, not フランス)
Machine-readable — 3 uppercase ASCII characters, trivially validatable with a regex

The API auto-uppercases inputs (fra → FRA) and returns clear error messages for invalid codes:

{
  "error": "\"passport\" value \"JP\" is not a valid ISO 3166-1 alpha-3 code. Did you mean \"JPN\"?"
}

This matters especially for MCP and agent usage where the LLM might send lowercase or alpha-2 codes. A clear error message lets the agent self-correct and retry.

What I'd do differently

Start with fewer languages. Launching with 15 simultaneously was ambitious. Starting with 5 (English, Spanish, French, Chinese, Arabic) and adding based on demand would have been faster and let us focus quality on the highest-impact languages first.

Invest in government source monitoring earlier. The hardest operational challenge isn't translation — it's knowing when a government changes its visa policy. Thailand's 60→30 day rollback happened via a cabinet resolution. That's not an RSS feed. Building automated monitoring for 136 government portals is an ongoing project.

Build the diff system from day one. We added the /changes endpoint later. If we'd built temporal versioning into the data model from the start (valid_from, valid_to on every record), the change detection and audit trail features would have been trivial instead of retrofitted.

Try the API

The multilingual response in action:

# English (free, no API key)
curl "https://visa.orizn.app/api/v1/visa/check?passport=JPN&destination=FRA"

# Japanese (needs free API key)
curl -H "x-api-key: YOUR_KEY" \
  "https://visa.orizn.app/api/v1/visa?passport=JPN&destination=FRA&lang=ja"

# Arabic
curl -H "x-api-key: YOUR_KEY" \
  "https://visa.orizn.app/api/v1/visa?passport=JPN&destination=FRA&lang=ar"