<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: iRoom</title>
    <description>The latest articles on DEV Community by iRoom (@iroom).</description>
    <link>https://dev.to/iroom</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3922830%2Fb893fa7c-5e7c-45da-bf8a-52734ac94411.png</url>
      <title>DEV Community: iRoom</title>
      <link>https://dev.to/iroom</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/iroom"/>
    <language>en</language>
    <item>
      <title>How We Built a Sub-200ms Multilingual Chat System Translating 100+ Languages with Our Own LLM</title>
      <dc:creator>iRoom</dc:creator>
      <pubDate>Sun, 10 May 2026 12:15:00 +0000</pubDate>
      <link>https://dev.to/iroom/how-we-built-a-sub-200ms-multilingual-chat-system-translating-100-languages-with-our-own-llm-55d8</link>
      <guid>https://dev.to/iroom/how-we-built-a-sub-200ms-multilingual-chat-system-translating-100-languages-with-our-own-llm-55d8</guid>
      <description>&lt;h1&gt;
  
  
  How We Built a Sub-200ms Multilingual Chat System Translating 100+ Languages with Our Own LLM
&lt;/h1&gt;

&lt;p&gt;A guest from Tokyo checks into a hotel in Istanbul. They want to ask about breakfast. The receptionist speaks Turkish and English. The guest writes in Japanese.&lt;/p&gt;

&lt;p&gt;For decades this meant confused gestures, dictionary apps, and often guests giving up entirely.&lt;/p&gt;

&lt;p&gt;We spent two years building the infrastructure to remove this friction completely. The result runs in production across 700+ hotels in 50+ countries, translating live conversations between guests and staff with median latency under 200 milliseconds. This is how the system works under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Generic Translation APIs Failed for Us
&lt;/h2&gt;

&lt;p&gt;We started where most teams would: piping messages through commercial translation APIs. Within three months in production we hit three walls that pushed us to build our own model.&lt;/p&gt;

&lt;p&gt;The first wall was tone. Hospitality is a register-sensitive domain. A polite Japanese request like 「お湯をいただけますか」 came out as flat imperative English. The reverse was worse — neutral English requests landed in Japanese, Korean, and German as informal or even rude. Asian languages with explicit honorific systems and Slavic languages with ty/vy distinctions were consistently mistranslated. Guest satisfaction scores in those languages were measurably lower.&lt;/p&gt;

&lt;p&gt;The second wall was domain vocabulary. Hospitality has its own lexicon. &lt;strong&gt;Concierge&lt;/strong&gt;, &lt;strong&gt;do not disturb&lt;/strong&gt;, &lt;strong&gt;continental breakfast&lt;/strong&gt;, &lt;strong&gt;valet parking&lt;/strong&gt;, &lt;strong&gt;early check-in&lt;/strong&gt;, &lt;strong&gt;late checkout&lt;/strong&gt;, &lt;strong&gt;no-show&lt;/strong&gt;, &lt;strong&gt;compendium&lt;/strong&gt;, &lt;strong&gt;turn-down service&lt;/strong&gt; — these have established equivalents across hospitality but generic models translate them literally. A Russian guest asking about "услуга вечерней подготовки номера" should translate to &lt;strong&gt;turn-down service&lt;/strong&gt;. Generic APIs returned things like "evening room preparation service".&lt;/p&gt;

&lt;p&gt;The third wall was operational. Per-character pricing scaled unpredictably as our network grew. Upstream rate limits caused tail latency spikes during peak hours when hotels needed reliability most. We had no control over model updates that could change behavior overnight, sometimes breaking carefully tuned hospitality phrasings without notice.&lt;/p&gt;

&lt;p&gt;By month four we had decided to build our own. By month eight we had a working prototype. Today, two years in, &lt;strong&gt;iRoom LLM&lt;/strong&gt; handles every message that flows through our network.&lt;/p&gt;

&lt;h2&gt;
  
  
  iRoom LLM: A Two-Year Engineering Investment
&lt;/h2&gt;

&lt;p&gt;iRoom LLM is a transformer-based translation and reasoning model fine-tuned specifically on hospitality conversational data. It runs entirely on our own infrastructure across multi-region GPU clusters, with no dependency on third-party inference APIs.&lt;/p&gt;

&lt;p&gt;The training corpus took eighteen months to assemble. We built it from four sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A base of multilingual web text filtered for hospitality and travel domain content&lt;/li&gt;
&lt;li&gt;A curated dataset of professionally-translated hotel collateral across 47 languages — welcome letters, in-room directories, service menus, signage, FAQ pages from luxury hotel chains&lt;/li&gt;
&lt;li&gt;Real conversational chat data from our own platform, with guest consent and full anonymization&lt;/li&gt;
&lt;li&gt;A synthetic dataset of hospitality dialogue scenarios we generated and then had native-speaker reviewers correct for naturalness in 30+ target languages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Training was iterative. We started from a 7B-parameter open-weight base, ran continued pretraining on our hospitality corpus, then progressively fine-tuned through supervised instruction-tuning and reinforcement learning from human feedback using hotel staff and native speakers as raters. The model was evaluated weekly against a held-out benchmark of 12,000 real guest-staff exchanges across the 25 most common language pairs in our network, scored on three axes: semantic accuracy, register preservation, and domain vocabulary fidelity.&lt;/p&gt;

&lt;p&gt;The result is a model that consistently outperforms commercial translation APIs on hospitality benchmarks while running 4-6× faster on our own hardware because we control the inference stack end-to-end.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Production Architecture
&lt;/h2&gt;

&lt;p&gt;The full message path:&lt;/p&gt;

&lt;p&gt;When a guest scans the QR code in their room, the frontend opens a Progressive Web App. Before the chat interface even renders, three things happen in parallel: the browser language is captured, IP-based country and likely-language are inferred from a cached GeoIP database we maintain in-edge, and the room's stored guest profile is queried for any language preference set during check-in. These three signals collapse into a single confirmed &lt;code&gt;guest_locale&lt;/code&gt; that gets locked in for the session.&lt;/p&gt;

&lt;p&gt;When the guest types a message, the client sends the raw UTF-8 text plus the &lt;code&gt;guest_locale&lt;/code&gt; and a session token to our edge nodes over a persistent WebSocket. We never trust client-side language detection — clients can be wrong, and WebViews on certain Android devices misreport language entirely.&lt;/p&gt;

&lt;p&gt;At the edge, the message hits a router that does three things atomically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writes the raw message to our primary store with the original language tagged&lt;/li&gt;
&lt;li&gt;Publishes the message ID to a translation queue keyed by &lt;code&gt;(target_locale_set)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Acknowledges the client&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Acknowledgement happens within 20-40ms on the p50 because the original-language storage write is the only blocking operation.&lt;/p&gt;

&lt;p&gt;The translation queue is consumed by inference workers running iRoom LLM. Each worker processes messages with the following pipeline:&lt;/p&gt;

&lt;p&gt;Cache lookup against normalized message representation&lt;br&gt;
→ Unicode NFC, casefold, punctuation strip, whitespace collapse&lt;br&gt;
→ Named entities (guest names, room numbers) replaced with typed placeholders&lt;br&gt;
→ Hit: return cached translation in &amp;lt;5ms&lt;br&gt;
→ Miss: continue&lt;br&gt;
Prompt construction&lt;br&gt;
→ Hospitality system instruction&lt;br&gt;
→ Conversation context window (last 3-5 messages)&lt;br&gt;
→ Source language + target language tags&lt;br&gt;
→ Source message&lt;br&gt;
iRoom LLM inference&lt;br&gt;
→ p50: 95ms warm GPU&lt;br&gt;
→ p95: 240ms&lt;br&gt;
Result storage&lt;br&gt;
→ Cache by normalized form&lt;br&gt;
→ Link to original message ID&lt;br&gt;
→ Emit on WebSocket bus&lt;/p&gt;

&lt;p&gt;When staff reads a message, the frontend requests rendering in their &lt;code&gt;staff_locale&lt;/code&gt;. The backend returns a payload with both the original and the translated text. Staff sees the translated version by default, with a tap or hover to reveal the original. Staff replies follow the same pipeline reversed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Engineering Decisions That Made This Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Original-as-source-of-truth, translation-as-derived-data
&lt;/h3&gt;

&lt;p&gt;Every message is stored exactly as the sender wrote it, in the source language, with the language explicitly tagged. Translations are derivative artifacts linked by message ID. We never overwrite a stored message with its translation.&lt;/p&gt;

&lt;p&gt;This sounds trivial but pays dividends repeatedly. When iRoom LLM was retrained at version 1.4 and again at 2.0, we re-translated portions of historical conversations to backfill improved quality without touching the canonical message store. When a hotel reported a translation error six months after the fact, we could trace the exact model version that produced it and deploy a corrected mapping. When a property requested an export of all guest conversations for legal compliance, we produced it in source-of-truth form, no language ambiguity.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Pivot through English, but smarter
&lt;/h3&gt;

&lt;p&gt;We pivot non-English ↔ non-English translations through English. Naive direct translation between every pair would require N×(N-1) optimized model paths. With pivoting, we maintain N high-quality paths to and from English.&lt;/p&gt;

&lt;p&gt;The naive concern is that pivoting compounds errors: Turkish → English → Korean should be worse than direct Turkish → Korean. In practice we found the opposite for most pairs because iRoom LLM's training data is much denser for non-English ↔ English pairs than for non-English ↔ non-English pairs. The English representation acts as a high-quality semantic intermediate.&lt;/p&gt;

&lt;p&gt;We do bypass the pivot for a small set of high-volume pairs where we have enough native data to make direct translation reliably better — Japanese ↔ Korean, Spanish ↔ Portuguese, Russian ↔ Ukrainian, Arabic ↔ Farsi. The bypass list is data-driven and adjusted quarterly based on benchmark performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Aggressive normalized caching
&lt;/h3&gt;

&lt;p&gt;Hospitality conversation is more repetitive than developers expect. &lt;em&gt;"What time is breakfast?"&lt;/em&gt; &lt;em&gt;"Is the gym open?"&lt;/em&gt; &lt;em&gt;"Where is the pool?"&lt;/em&gt; &lt;em&gt;"Can I get extra towels?"&lt;/em&gt; These exact phrases — across thousands of variant spellings, capitalizations, and punctuations — flow through the network millions of times per month.&lt;/p&gt;

&lt;p&gt;Our cache layer stores translations keyed by &lt;code&gt;(normalized_text, source_lang, target_lang)&lt;/code&gt;. Normalization runs as a deterministic function: Unicode NFC normalization, casefold, punctuation stripping, whitespace collapse, and named-entity substitution. &lt;em&gt;"Hi!"&lt;/em&gt; &lt;em&gt;"hi"&lt;/em&gt; &lt;em&gt;"HI!!!"&lt;/em&gt; all hit the same entry. &lt;em&gt;"Can I get an extra towel for room 412?"&lt;/em&gt; and &lt;em&gt;"Can I get an extra towel for room 815?"&lt;/em&gt; both normalize to &lt;em&gt;"can i get an extra towel for room ROOM_NUM"&lt;/em&gt; and share a cache slot.&lt;/p&gt;

&lt;p&gt;Production hit rate is 47% network-wide, with translation cost cut roughly in half and p50 latency on cache hits at 4ms. The cache is sharded across regions, eventually consistent, with TTLs tuned per language pair based on observed translation drift between iRoom LLM versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Lazy fan-out, never eager translation
&lt;/h3&gt;

&lt;p&gt;A conversation might have one guest writing in Japanese and three staff members watching in Turkish, English, and Russian. Naive eager translation generates three translations per guest message. Lazy translation generates one per staff member, only when they actually read.&lt;/p&gt;

&lt;p&gt;Most staff don't read most messages in real-time — they read the queue when they sit down at the front desk between guest interactions. Lazy translation cut our compute budget by approximately 70% versus eager generation, with no perceptible impact on perceived latency because translation happens during the read request, which already has expected network latency anyway.&lt;/p&gt;

&lt;p&gt;Implementation-wise, the message bus emits a &lt;code&gt;message-arrived&lt;/code&gt; event that subscribed staff clients receive as a "new message" notification. The client requests the translated rendering only when the staff member focuses the conversation. Cache hit rate on these requests is even higher than on guest writes because the same message is often read by multiple staff members in different languages — second and third reads are nearly always cached.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Domain context as system prompt, conversation history as context window
&lt;/h3&gt;

&lt;p&gt;iRoom LLM receives more than just the message to translate. Each inference call constructs a prompt containing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The hospitality system instruction (preserve register, use formal honorifics where appropriate, recognize hospitality terminology, avoid literal translation of cultural idioms)&lt;/li&gt;
&lt;li&gt;The last 3-5 messages of conversation context&lt;/li&gt;
&lt;li&gt;Source language and target language tags&lt;/li&gt;
&lt;li&gt;The message itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The conversation context window is the difference between coherent threading and isolated mistranslation. &lt;em&gt;"Yes, please"&lt;/em&gt; translated standalone is ambiguous in many languages. With the previous message visible (&lt;em&gt;"Would you like extra pillows?"&lt;/em&gt;), the translation becomes correct in every target language we support.&lt;/p&gt;

&lt;p&gt;We tuned the context window length empirically. Three messages is the sweet spot — longer windows added latency and noise from older off-topic exchanges; shorter windows lost critical referent information. The window is also right-truncated when the conversation exceeds 5 turns, retaining only the most recent turns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Numbers
&lt;/h2&gt;

&lt;p&gt;Two years in, here is what the system delivers in steady state across the network:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end p50 latency (send → translated display)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;180ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end p95 latency&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;480ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end p99 latency&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;950ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;iRoom LLM inference p50 (warm GPU)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;95ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;iRoom LLM inference p95&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;240ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit rate (network-wide)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;47%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit rate (high-volume properties)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70%+&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Translation cost per active room per month&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$0.04&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hospitality benchmark score (BLEU-4 + COMET + register)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.91&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System availability (12-month rolling)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;99.97%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For comparison, commercial translation APIs we benchmarked against scored 0.74-0.81 on the same hospitality benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Got Wrong on the Way Here
&lt;/h2&gt;

&lt;p&gt;Plenty.&lt;/p&gt;

&lt;p&gt;Our first translation cache implementation used unnormalized message text as the key. Hit rate was &lt;strong&gt;8%&lt;/strong&gt;. Adding deterministic normalization and named-entity substitution lifted it to &lt;strong&gt;47%&lt;/strong&gt;. We could have done this in week one but didn't think repetition would matter that much. It mattered enormously.&lt;/p&gt;

&lt;p&gt;We initially built without a conversation context window. iRoom LLM was translating each message in isolation. Quality on multi-turn dialogues was significantly worse than on standalone messages, and we didn't catch it for months because our automated benchmarks tested isolated message pairs. Adding a 3-message context window fixed it; building the benchmark for multi-turn dialogues should have come first.&lt;/p&gt;

&lt;p&gt;We over-engineered the streaming inference path early on. Token-by-token streaming of translations to the client felt like a good idea — staff would see translation appear as it generated. In practice, hotel staff prefer to see the complete translated message at once because partial messages look broken on their dashboards. We removed streaming and saved the implementation complexity.&lt;/p&gt;

&lt;p&gt;Our first multi-region deployment routed all inference to a single primary region. Tail latency for Asia-Pacific hotels was awful. We migrated to region-local inference clusters with model weight replication, which cut p99 latency in half and increased cost only marginally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Runs
&lt;/h2&gt;

&lt;p&gt;This entire system powers the chat layer of &lt;a href="https://iroom.help" rel="noopener noreferrer"&gt;iRoom&lt;/a&gt;, a digital guest service platform used by 700+ hotels worldwide. The chat is one feature among many — there's a QR-based service ordering system, a real-time room status dashboard, a multi-channel notification fanout to web, desktop, and Telegram, an AI receptionist that handles routine questions before they reach human staff — but the multilingual chat is the piece that took the most engineering investment and is the one that most clearly removes a problem hotels could not solve before.&lt;/p&gt;

&lt;p&gt;If you're building anything that involves real-time cross-lingual communication at scale, hopefully some of these decisions save you time. We learned most of them by getting them wrong first.&lt;/p&gt;

&lt;p&gt;Happy to discuss any of these tradeoffs in the comments.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>nlp</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
