<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Thiago V.</title>
    <description>The latest articles on DEV Community by Thiago V. (@tverney_77).</description>
    <link>https://dev.to/tverney_77</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852303%2Fc28fe7f4-a933-407f-8c6b-742b24f97742.jpeg</url>
      <title>DEV Community: Thiago V.</title>
      <link>https://dev.to/tverney_77</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tverney_77"/>
    <language>en</language>
    <item>
      <title>Your AI Agent Forgets Everything — Here's a Daemon That Fixes That</title>
      <dc:creator>Thiago V.</dc:creator>
      <pubDate>Tue, 07 Apr 2026 16:08:59 +0000</pubDate>
      <link>https://dev.to/tverney_77/your-ai-agent-forgets-everything-heres-a-daemon-that-fixes-that-5ao0</link>
      <guid>https://dev.to/tverney_77/your-ai-agent-forgets-everything-heres-a-daemon-that-fixes-that-5ao0</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally posted on &lt;a href="https://builder.aws.com/content/3C29ijZpMg6xOxI9Rddl73OMaCX/memory-consolidation-daemon-for-ai-agents-with-bedrock" rel="noopener noreferrer"&gt;AWS Builder&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You spend an hour teaching it your project structure, your coding preferences, the weird Bedrock timeout issue you debugged last Tuesday. Next session? Gone. You're back to explaining that you prefer single quotes and that the CI pipeline needs &lt;code&gt;--run&lt;/code&gt; to avoid watch mode.&lt;/p&gt;

&lt;p&gt;Some frameworks have memory plugins. They work — sort of. But they're coupled to one framework, they accumulate junk over time, and nobody's cleaning up the contradictions from three weeks ago when you changed your mind about the database.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;agent-memory-daemon&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;It's a background daemon that runs alongside your agent — any agent. It watches a directory of session files and does two things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extraction&lt;/strong&gt; — scans new session transcripts and pulls out facts, decisions, preferences, and error corrections. Writes each one as a structured markdown file with YAML frontmatter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consolidation&lt;/strong&gt; — periodically reviews the entire memory directory. Merges duplicates, converts relative dates to absolute, removes contradicted facts, prunes stale content, and keeps a concise &lt;code&gt;MEMORY.md&lt;/code&gt; index under a size budget.&lt;/p&gt;

&lt;p&gt;The filesystem is the interface. Your agent writes markdown files to a directory. The daemon reads them, thinks about them, and writes organized memories back. No SDK, no API, no MCP server. If your agent can write a file, it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "aha" moment
&lt;/h2&gt;

&lt;p&gt;I was running an agent that had accumulated 40+ memory files over a few weeks. Half of them were duplicates with slightly different wording. Three of them contradicted each other about which AWS region we were using. The &lt;code&gt;MEMORY.md&lt;/code&gt; index was 800 lines long and the agent was spending half its context window just reading its own memories.&lt;/p&gt;

&lt;p&gt;That's when I realized: agents need a janitor. Not just a place to store memories, but something that actively curates them.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Extraction (discovering new memories)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session file modified
        ↓
Cursor check: is this new content?
        ↓
Build prompt: memory manifest + session content
        ↓
LLM identifies facts, decisions, preferences
        ↓
Write structured memory files
        ↓
Advance cursor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The daemon tracks a &lt;code&gt;.extraction-cursor&lt;/code&gt; file — a per-session offset map so it only processes genuinely new content. If a session file gets appended to, it picks up where it left off instead of reprocessing the whole thing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consolidation (organizing existing memories)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Three-gate trigger: time elapsed + session count + lock
        ↓
Four-phase pass: orient → gather → consolidate → prune
        ↓
Merge duplicates, resolve contradictions
        ↓
Update MEMORY.md index (200 lines / 25KB budget)
        ↓
Release lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both modes share a PID-based lock and never run concurrently. Consolidation takes priority — if both triggers fire on the same tick, consolidation runs first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx agent-memory-daemon init    &lt;span class="c"&gt;# generates memconsolidate.toml&lt;/span&gt;
npx agent-memory-daemon start   &lt;span class="c"&gt;# starts the daemon&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The config is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;memory_directory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./memory"&lt;/span&gt;
&lt;span class="py"&gt;session_directory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./sessions"&lt;/span&gt;

&lt;span class="py"&gt;extraction_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;extraction_interval_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt;

&lt;span class="nn"&gt;[llm_backend]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"bedrock"&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"us-east-1"&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"us.anthropic.claude-sonnet-4-20250514-v1:0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use OpenAI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[llm_backend]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;
&lt;span class="py"&gt;api_key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${OPENAI_API_KEY}"&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What a memory file looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bedrock&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;configuration"&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Default&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SDK&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;too&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;short&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;large&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;prompts"&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;reference&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
The AWS SDK's default request timeout causes ECONNABORTED errors
on prompts over 30K characters. Set requestTimeout to 300000 (5 min)
via NodeHttpHandler when using BedrockRuntimeClient.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each file has a type: &lt;code&gt;user&lt;/code&gt; (preferences), &lt;code&gt;feedback&lt;/code&gt; (lessons learned), &lt;code&gt;project&lt;/code&gt; (architecture decisions), or &lt;code&gt;reference&lt;/code&gt; (technical facts). The daemon classifies them automatically during extraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework-agnostic by design
&lt;/h2&gt;

&lt;p&gt;The integration pattern is the same regardless of what you're building with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strands / LangChain&lt;/strong&gt;: after each agent run, dump a session summary to the sessions directory. At startup, read &lt;code&gt;MEMORY.md&lt;/code&gt; into the system prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt;: point &lt;code&gt;session_directory&lt;/code&gt; at your workspace's transcript directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom agents&lt;/strong&gt;: same pattern — write files, read the index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No plugin system, no adapter layer. The filesystem is the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guardrails
&lt;/h2&gt;

&lt;p&gt;One thing I learned the hard way: without limits, the extraction mode creates files exponentially. Each pass sees the new files from the last pass, prompts the LLM with a bigger manifest, and the LLM creates even more files.&lt;/p&gt;

&lt;p&gt;So there are guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_memory_files&lt;/code&gt; — hard cap on total files in the directory (default: 50)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_files_per_batch&lt;/code&gt; — cap on creates per extraction pass (default: 10)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_prompt_chars&lt;/code&gt; — budget enforcement with progressive truncation&lt;/li&gt;
&lt;li&gt;Per-session cursor — prevents reprocessing already-extracted content&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Every operation emits structured JSON logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"2026-04-07T14:23:01.234Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"info"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"extraction:complete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"created"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"updated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"durationMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4521&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"promptLength"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;39102&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"operationsRequested"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"operationsApplied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"operationsSkipped"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get duration, prompt size, operation counts, and skip reasons. Pipe it to CloudWatch, Datadog, or just &lt;code&gt;jq&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vector similarity search for memory recall (right now it's manifest-based)&lt;/li&gt;
&lt;li&gt;Multi-agent support (shared memory directories with conflict resolution)&lt;/li&gt;
&lt;li&gt;A web UI for browsing and editing memories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project is MIT-licensed and on &lt;a href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. Issues, PRs, and feedback are welcome.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;agent-memory-daemon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your agent keeps forgetting things, give it a daemon with a good memory.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>typescript</category>
    </item>
    <item>
      <title>The Irony of Language Models That Don't Speak Your Language</title>
      <dc:creator>Thiago V.</dc:creator>
      <pubDate>Mon, 30 Mar 2026 20:53:42 +0000</pubDate>
      <link>https://dev.to/tverney_77/the-irony-of-language-models-that-dont-speak-your-language-5b5</link>
      <guid>https://dev.to/tverney_77/the-irony-of-language-models-that-dont-speak-your-language-5b5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;This is a personal project and article. The opinions expressed here are my own and do not reflect the opinions of AWS or Amazon. This project is not an AWS product and is not endorsed by or affiliated with AWS.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI is plugged everywhere now and its a breakthrough advanced technology.&lt;/p&gt;

&lt;p&gt;However, there is a key element which turns out to be an elephant in the room that is not in the major headliner topics: LLMs are fundamentally centric to high-resource languages, and most specifically, English.&lt;/p&gt;

&lt;p&gt;The only publicly disclosed training data breakdown — GPT-3 (Brown et al., 2020) — showed over 92% English tokens. Newer models don't publish exact ratios, but the picture has evolved: Llama 3 remains heavily English, GPT-4o highlights improved multilingual performance as a key feature, and models like Qwen and Aya have invested significantly more in non-English data. The gap is narrowing for high-resource languages, but for the thousands of low-resource languages, the structural imbalance remains.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0kx394w7v2kwjxajsg3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0kx394w7v2kwjxajsg3.png" alt=" " width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The remaining languages — spoken by billions of people — are either poorly represented through low-quality machine-translated English content, or absent entirely.&lt;/p&gt;

&lt;p&gt;This means that when a Thai farmer asks about crop subsidies, when a Nigerian mother searches for vaccination schedules in Yoruba, or when a Brazilian citizen navigates tax forms in Portuguese, the AI they're interacting with is operating at a fraction of its true capability.&lt;/p&gt;

&lt;p&gt;Not because the intelligence isn't there, but because the model was never properly taught to listen in their language. The industry celebrates "human-level performance" on benchmarks, but those benchmarks are overwhelmingly English. For most of the world, the AI revolution hasn't arrived yet — it's still stuck at customs, waiting for a translator.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ancient Myth
&lt;/h2&gt;

&lt;p&gt;Around 4,000 years ago, Babylon was the most cosmopolitan city on Earth.&lt;/p&gt;

&lt;p&gt;Situated at the crossroads of ancient trade routes in modern-day Iraq, it was a place where Akkadian, Sumerian, Aramaic, Elamite, and dozens of other languages collided daily. Merchants, scholars, and diplomats from across Mesopotamia converged there, and the city thrived precisely because it found ways to bridge those languages — through scribes, translators, and the world's first multilingual libraries.&lt;/p&gt;

&lt;p&gt;The biblical story of the Tower of Babel, set in Babylon, tells it differently: God scattered humanity across the earth and confused their languages so they could no longer understand each other. It's a story about the fracturing of communication — the moment when a shared project became impossible because people could no longer speak the same language.&lt;/p&gt;

&lt;p&gt;We're living in a strange echo of that story. We've built the most powerful reasoning machines in human history — LLMs that can write poetry, prove theorems, and generate working code. But these machines think in English. When the rest of the world tries to speak to them, the tower crumbles. Not because the intelligence isn't there, but because the language barrier corrupts the signal before it reaches the model's reasoning core.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of Multilingual AI
&lt;/h2&gt;

&lt;p&gt;Ask any frontier LLM a question in English, and you'll get a polished, accurate, well-reasoned response. Now ask the same question in Thai. Or Amharic. Or even Portuguese.&lt;/p&gt;

&lt;p&gt;Suddenly, the magic fades.&lt;/p&gt;

&lt;p&gt;The response might be shorter, vaguer, or riddled with English fragments leaking through. In some cases, it's outright gibberish. And here's the part nobody talks about: you're paying more for that worse response.&lt;/p&gt;

&lt;p&gt;While the industry celebrates benchmark after benchmark showing LLMs reaching "human-level performance," there's a massive asterisk: in English. For the 6,950 other languages spoken on this planet, AI remains broken, expensive, and in some cases, unreliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Don't Lie
&lt;/h2&gt;

&lt;p&gt;Most leading LLMs allocate approximately 92% of their training tokens to English (&lt;a href="https://arxiv.org/abs/2005.14165" rel="noopener noreferrer"&gt;Brown et al., "Language Models are Few-Shot Learners", NeurIPS 2020&lt;/a&gt;). Of the approximately 7,000 spoken languages globally, most models only cover about 50 high-resource ones (&lt;a href="https://www.frontiersin.org/research-topics/77716/language-models-for-low-resource-languages" rel="noopener noreferrer"&gt;Frontiers Research Topic: Language Models for Low-Resource Languages&lt;/a&gt;). The remaining languages lack both the digital data and quality resources to benefit from recent AI advancements — creating barriers to education, healthcare, financial access, and employment for the communities that speak them.&lt;/p&gt;

&lt;p&gt;But the problem goes deeper than just quality. It's about money.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Language Tax
&lt;/h2&gt;

&lt;p&gt;LLMs use tokenizers to break text into chunks before processing. These tokenizers were designed primarily for English. When you feed them Thai, Japanese, Arabic, or Korean text, the same semantic content gets split into 2-4x more tokens.&lt;/p&gt;

&lt;p&gt;I built a proxy called &lt;a href="https://github.com/tverney/llm-proxy-babylon" rel="noopener noreferrer"&gt;LLM Proxy Babylon&lt;/a&gt; to measure this. Here's what I found with a real Thai prompt about sorting algorithms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Direct Thai&lt;/th&gt;
&lt;th&gt;Optimized (English)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt tokens&lt;/td&gt;
&lt;td&gt;~166&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token savings&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;70% fewer input tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality score&lt;/td&gt;
&lt;td&gt;0.456&lt;/td&gt;
&lt;td&gt;~0.949 (English-level)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's 3.4x fewer input tokens and 2x better quality. At Amazon Nova Lite pricing on Bedrock ($0.06/1M input tokens), sending 1M Thai prompts of this size would cost ~$0.01 directly vs ~$0.003 through the optimizer — and the optimized path delivers dramatically better responses.&lt;/p&gt;

&lt;p&gt;The savings scale dramatically with premium models. At Claude Opus 4 pricing on Bedrock ($15/1M input tokens), the same 1M Thai prompts would cost $2.49 directly vs $0.74 through the optimizer — a $1.75 saving per million requests on input tokens alone, with better quality on every response.&lt;/p&gt;

&lt;p&gt;Every company running a multilingual chatbot is silently paying this tax. Their English-speaking users get fast, cheap, high-quality responses. Their Thai-speaking users get slower, more expensive, lower-quality responses — for the same product, same subscription price.&lt;/p&gt;

&lt;p&gt;And it compounds. Chatbots send the full conversation history with every request. A 10-message conversation in Thai accumulates tokens 3x faster than the same conversation in English. By turn 10, you're sending massive context windows that cost a fortune and may even overflow the model's limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Government Chatbots Can't Serve Their Own Citizens
&lt;/h2&gt;

&lt;p&gt;Now imagine this problem at the scale of a government.&lt;/p&gt;

&lt;p&gt;Countries across Southeast Asia, Africa, the Middle East, and South America are deploying AI-powered chatbots to help citizens access healthcare information, navigate tax systems, apply for social programs, and find emergency services. These are critical services that directly impact people's lives.&lt;/p&gt;

&lt;p&gt;But here's the catch: the LLMs powering these chatbots were trained on English. When a farmer in rural Thailand asks about crop subsidies in Thai, the model's reasoning capability drops by nearly 50%. When a mother in Nigeria asks about childhood vaccination schedules in Yoruba, the model might not even understand the question properly.&lt;/p&gt;

&lt;p&gt;The irony is painful: governments invest in AI to serve their citizens better, but the AI itself delivers unequal quality across languages. Not intentionally — but structurally, through training data imbalance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Safety Gap Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;It gets worse. Research shows that low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages — and in intentional attack scenarios, unsafe output rates can reach over 80% (&lt;a href="https://arxiv.org/abs/2310.06474" rel="noopener noreferrer"&gt;Deng et al., "Multilingual Jailbreak Challenges in Large Language Models", 2023&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;LLM safety guardrails — the filters that prevent models from generating harmful content — were primarily trained on English data.&lt;/p&gt;

&lt;p&gt;This means a prompt injection attack that would be caught instantly in English can sail right through in Amharic or Lao. The model simply doesn't recognize the harmful intent in languages it barely understands.&lt;/p&gt;

&lt;p&gt;For any organization deploying AI in production — especially in healthcare, finance, or government — this isn't just a quality issue. It's a liability.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Different Approach: Don't Fix the Model, Route Around It
&lt;/h2&gt;

&lt;p&gt;The conventional wisdom says: "Just train better multilingual models." And yes, that's happening. But it's slow, expensive, and may never fully close the gap for the thousands of low-resource languages that lack sufficient training data.&lt;/p&gt;

&lt;p&gt;What if we could get English-level quality from any language, today, without retraining a single model?&lt;/p&gt;

&lt;p&gt;That's the idea behind &lt;a href="https://github.com/tverney/llm-proxy-babylon" rel="noopener noreferrer"&gt;LLM Proxy Babylon&lt;/a&gt; — an open-source proxy I built that sits between your application and any LLM API.&lt;/p&gt;

&lt;p&gt;It detects the input language, decides whether translating to English would improve results, and if so, translates the prompt before sending it to the model. Then it appends a simple instruction: "Please respond in Thai since the original question was asked in Thai."&lt;/p&gt;

&lt;p&gt;LLM Proxy Babylon is named for the city, not the curse. It's an attempt to do what ancient Babylon did: sit at the crossroads of languages and make sure everyone gets understood.&lt;/p&gt;

&lt;p&gt;The key insight: LLMs have no difficulty generating output in a specified language. The performance gap is in understanding non-English prompts and in producing non-English response, so input translation for reasoning quality, and optional output translation for low-resource languages where the LLM's generation is also lossy. So we translate the input (where the model is weak) and let the model handle the output (where it's strong).&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Results
&lt;/h2&gt;

&lt;p&gt;I tested this with Mistral 7B on a Thai prompt about bubble sort complexity. The results were dramatic:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without the optimizer (direct Thai):&lt;/strong&gt; The model produced garbled output mixing English fragments into Thai text ("วงจirkle", "sorteering technique"), with confused, repetitive reasoning. 1,749 tokens of mostly noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With the optimizer (translated to English first):&lt;/strong&gt; The same model produced a clean, structured response correctly explaining O(n²) vs O(n log n) complexity, listing Merge Sort, Quick Sort, and Heap Sort with accurate Big-O analysis — all responded back in Thai. 1,446 tokens of useful content.&lt;/p&gt;

&lt;p&gt;The model's reasoning capability was there all along. It just couldn't access it through Thai input.&lt;/p&gt;

&lt;p&gt;I also benchmarked Amazon Nova Lite across multiple languages using the built-in evaluation harness:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Quality Score&lt;/th&gt;
&lt;th&gt;Delta from English&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;English (baseline)&lt;/td&gt;
&lt;td&gt;0.949&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Portuguese&lt;/td&gt;
&lt;td&gt;0.763&lt;/td&gt;
&lt;td&gt;-0.19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Korean&lt;/td&gt;
&lt;td&gt;0.663&lt;/td&gt;
&lt;td&gt;-0.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;0.595&lt;/td&gt;
&lt;td&gt;-0.35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Thai&lt;/td&gt;
&lt;td&gt;0.456&lt;/td&gt;
&lt;td&gt;-0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern maps exactly to language resource availability. Portuguese (high-resource) takes the smallest hit. Thai (low-resource) loses nearly half the quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The proxy exposes an OpenAI-compatible API, so it works as a drop-in replacement with any framework — LangChain, Strands Agents, or any OpenAI SDK client. Just change the base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands.models.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIModel&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;base_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us.amazon.nova-lite-v1:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;อธิบายแนวคิดของ recursion ในการเขียนโปรแกรม&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, each request flows through a pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detect&lt;/strong&gt; the language (using franc for BCP-47 identification)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parse&lt;/strong&gt; mixed content (preserve code blocks, URLs, JSON — only translate natural language)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classify&lt;/strong&gt; the task type (reasoning, math, code-generation, culturally-specific)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route&lt;/strong&gt; — decide whether to translate, skip, or use hybrid mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translate&lt;/strong&gt; the prompt to English (if beneficial)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inject&lt;/strong&gt; a language instruction ("Please respond in Thai...")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forward&lt;/strong&gt; to the LLM (supports AWS Bedrock and OpenAI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return&lt;/strong&gt; the response to the client&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The routing engine is smart about when NOT to translate. Culturally-specific questions ("What's good tonight in Paris?") skip translation because the model needs cultural context, not English reasoning. English prompts skip entirely. The system only translates when it expects a quality improvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built on AWS
&lt;/h3&gt;

&lt;p&gt;The proxy supports AWS Bedrock natively via the Converse API. Authentication is handled automatically through the AWS SDK — no API keys needed in requests. I tested with Amazon Nova Lite and Mistral 7B, both available on Bedrock.&lt;/p&gt;

&lt;p&gt;For translation, it supports Amazon Translate ($15/1M characters, high quality for proper nouns and technical content) and LibreTranslate (self-hosted, free) out of the box, with a pluggable interface for DeepL or Google Translate. Just set &lt;code&gt;TRANSLATOR_BACKEND=amazon-translate&lt;/code&gt; to switch — uses your existing AWS credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Conversation Cache: Solving the Multi-Turn Problem
&lt;/h2&gt;

&lt;p&gt;Multi-turn conversations are where the token tax really hurts. Every request sends the full history, and that history is in the user's language — eating 2-4x more tokens per turn.&lt;/p&gt;

&lt;p&gt;The proxy includes a conversation translation cache. Pass an &lt;code&gt;X-Conversation-Id&lt;/code&gt; header and previously translated messages are pulled from cache instead of being re-translated. By turn 10, you get 9 cache hits and only 1 miss per request — 9 translation API calls saved, and the LLM always sees a lean English context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Quality: Safety as a Side Effect
&lt;/h2&gt;

&lt;p&gt;By translating low-resource language prompts to English before sending them to the LLM, the optimizer routes every prompt through the model's strongest safety alignment. A harmful prompt in Thai or Amharic gets evaluated by English-trained guardrails operating at full strength, rather than the weaker low-resource language alignment.&lt;/p&gt;

&lt;p&gt;This isn't a complete safety solution — but for the common case, it significantly narrows the 3x safety gap between high-resource and low-resource languages identified by Deng et al.&lt;/p&gt;

&lt;h2&gt;
  
  
  But What If Models Get Better at Multilingual?
&lt;/h2&gt;

&lt;p&gt;They will — and the optimizer is designed for that.&lt;/p&gt;

&lt;p&gt;The token cost problem is structural, not a training problem. BPE tokenizers will always split Thai, Arabic, and Korean into 2-4x more tokens than semantically equivalent English. Unless providers fundamentally redesign their tokenizers and retrain everything, the cost disparity persists regardless of how multilingual the model becomes.&lt;/p&gt;

&lt;p&gt;Conversation history compounding doesn't go away either. Even a perfectly multilingual model still charges per token. A 10-turn Thai conversation still accumulates tokens 3x faster than English. The conversation translation cache solves this at the infrastructure level.&lt;/p&gt;

&lt;p&gt;RAG retrieval is an embedding problem, not an LLM problem. Vector embeddings are English-centric. Translating queries to English before retrieval improves recall regardless of how good the LLM itself is at understanding Thai.&lt;/p&gt;

&lt;p&gt;Fine-tuning ROI is permanent. Companies fine-tune on English domain data. A perfectly multilingual base model still won't have that domain-specific knowledge accessible through non-English prompts unless the fine-tuning data was also multilingual — which it almost never is.&lt;/p&gt;

&lt;p&gt;Safety alignment will always lag for low-resource languages. Even as models improve, safety training data will remain English-heavy. Routing through English for safety filtering is a defense-in-depth strategy that stays relevant.&lt;/p&gt;

&lt;p&gt;And the adaptive router handles the transition gracefully. As models get better at specific languages, the shadow evaluator detects that translation no longer helps, and the router automatically switches to skip. The proxy doesn't fight against model improvements — it adapts to them. For a language where the model reaches English parity, the proxy becomes a transparent pass-through with zero overhead.&lt;/p&gt;

&lt;p&gt;Today the proxy is primarily about quality. As models improve, it becomes primarily about cost optimization, safety, and RAG. The architecture already supports that transition because routing decisions are data-driven, not hardcoded assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is an open-source project and there's a lot more to explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG improvement&lt;/strong&gt; — translate queries to English before vector retrieval for better recall (current architecture supports it)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning ROI&lt;/strong&gt; — ensure non-English users benefit from English-only fine-tuned models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dialect detection&lt;/strong&gt; — handle Egyptian Arabic vs Modern Standard Arabic, European vs Brazilian Portuguese&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Question We Should Be Asking
&lt;/h2&gt;

&lt;p&gt;The next time someone says "all LLMs are the same," ask them: &lt;strong&gt;in which language?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI won't truly be intelligent until it understands every language, every culture. Until then, tools like LLM Proxy Babylon can bridge the gap — giving every user English-level quality, regardless of what language they think in.&lt;/p&gt;

&lt;p&gt;The code is open source: &lt;a href="https://github.com/tverney/llm-proxy-babylon" rel="noopener noreferrer"&gt;github.com/tverney/llm-proxy-babylon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;273 property-based tests. Real benchmarks. Ready to deploy.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://builder.aws.com/content/3BfRX8ILgQnT0aO1vmYWvgVCHKT/the-irony-of-language-models-that-dont-speak-your-language" rel="noopener noreferrer"&gt;AWS Builder Center&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>i18n</category>
      <category>llm</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
