Why ChatGPT Hallucinates UAE Visa Rules — and How I Built a Localized AI Chat That Doesnt

#ai #chatbot #arabic #rag

Ask ChatGPT for the visa requirements to bring your parents to Dubai on a residency. The answer will sound confident. About half of it will be wrong — wrong income threshold, wrong document list, conflated rules from 2019.

Ask it which areas in Dubai are freehold for foreign buyers, or where you can buy alcohol on a Friday in Sharjah, or how the Emirates ID renewal grace period actually works. Same pattern. Confident hallucination on UAE-specific facts.

This is the gap Tovi exists to fill. We built a UAE-focused AI chat — like ChatGPT, but with localized knowledge baked in, in English and Arabic. Here is what I learned about the gap, why generic LLMs fail at this, and what actually moved the needle.

The size of the localization gap

The big foundation models (GPT-4, Claude, Grok, Gemini) are trained primarily on English-language English/American/European internet. When you ask them about UAE topics, three things go wrong:

Training data is sparse. There's roughly 1/100th the public English-language text about UAE visa rules as there is about US immigration. The model has seen the topic but not enough to be confident.
The data it HAS is often outdated. UAE policy moves fast — Golden Visa criteria changed twice in 2023, freelance visa rules in 2024, Emirates ID renewal grace period in 2025. The training cutoff is always behind.
The model fills gaps by extrapolating from other countries' rules. Often plausibly, often wrong. It'll quote "you need three months of bank statements" because that's true in the UK, even though UAE wants six.

The result is a confident-sounding answer that fails the basic test of whether someone reading it could actually act on it.

Why fine-tuning isn't the answer

Early in the project I assumed we'd fine-tune a small open model on UAE government documents, FAQ sites, immigration forums. Don't do this.

Fine-tuning bakes information into weights. UAE info changes every few months. Re-fine-tuning every quarter is expensive, slow, and produces models that are still confident-wrong on edge cases.

What works better: retrieval-augmented generation (RAG) over a curated, frequently-updated knowledge layer, combined with a strong base model that knows how to say it doesn't know.

The layers we ended up with:

User question (EN or AR)
  → intent classifier (is this UAE-specific? which domain?)
  → vector search over curated UAE knowledge base (visa, RERA, DLD, gov, retail, transport)
  → LLM synthesizes answer using ONLY the retrieved facts as ground truth
  → LLM adds source citations
  → translation layer if user prefers Arabic

The classifier matters. Generic chitchat ("hi how are you", "tell me a joke") should NOT go through the UAE knowledge layer — it should hit the LLM directly with a friendly system prompt. We learned this the hard way after early versions answered "hi" by trying to retrieve UAE facts and producing nonsense.

The Arabic problem

Arabic isn't optional in the UAE — it's the official language, and Arabic users expect the chat to feel native, not translated.

Three things that don't work:

Translating user input to English, processing, translating back. Loses context, especially formal/informal distinctions and gender agreement.
Asking the model to "respond in Arabic" via system prompt. Output quality is dramatically worse than English. The model is just less competent in Arabic.
Using cheap translation APIs. They mangle proper nouns ("Dubai Marina" becomes a literal translation).

Three things that work:

Run the same RAG pipeline in the user's language. Keep an Arabic knowledge base in parallel to the English one, with the same facts.
Use Claude or Grok for Arabic, not GPT-4. In our blind testing, Anthropic and xAI produce more natural Arabic than OpenAI does. Counter-intuitive but consistent.
Translate Arabic outputs ONLY if the source content is English and you have no Arabic version. Use DeepL not Google Translate. Then have a final LLM pass to polish.

The trust loop: showing sources

Generic ChatGPT answers UAE questions without sources. Tovi shows sources for everything UAE-specific.

This sounds obvious, but it's the difference between "I'll try Tovi for this one question" and "I'll keep using Tovi." When the user can click through to the official Dubai DED page or the GDRFA visa portal, they trust the next answer too. When they can't, they're back to verifying everything on Google anyway.

Sources also force you to keep the knowledge base honest. If a source 404s, we know that fact is stale. We have a nightly job that checks every cited URL and flags broken ones for re-curation.

What's hard, what's still broken

Hard but worth it:

Maintaining the UAE knowledge base. Visa rules, RERA permits, DLD fees, free zone licenses — all of these change. We have a part-time human reviewing official sources weekly. This is unsexy but it's the moat.
Handling the "I asked in English but want the answer in Arabic" case — common for second-language Arabic speakers.

Still broken:

Dialect handling. Tovi understands MSA (Modern Standard Arabic) and Emirati dialect okay. Egyptian, Levantine, Maghrebi dialect users get worse answers because we don't yet have enough training samples in those.
Real-time data. We don't fetch live prayer times, exchange rates, or traffic data. We're working on tool-calling for these, but the latency hit on a chat experience is brutal — every tool call adds ~800ms.
Edge cases on government processes. If a user has an unusual case ("I'm a stateless person born in UAE, can I apply for citizenship?"), we point them to a human. The model knows when to defer.

The lesson for vertical AI in general

The temptation when building "AI for X" is to wrap a foundation model in a UI and ship it. For some domains (creative writing, summarization, generic Q&A) that works.

For domains where users will act on the answer — visa decisions, real estate purchases, medical questions, legal advice — the wrapper is the easy part. The hard part is the curated, regularly-updated knowledge layer that prevents the model from confidently lying to people who'll trust it.

The verticals where vertical-AI products will win in 2026 are the ones where the team behind the product treats the knowledge layer as the actual product, and the LLM as a thin interface over it.

If you live in the UAE or want to know how things actually work here — Tovi is free to try. Ask in English or Arabic. If it answers something wrong, tell us — the knowledge base gets better every week.