<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Phasu  Yeneng</title>
    <description>The latest articles on DEV Community by Phasu  Yeneng (@kmusicman).</description>
    <link>https://dev.to/kmusicman</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1381246%2Fe58a0893-0bc2-4a5d-9845-3bbe41076adf.jpeg</url>
      <title>DEV Community: Phasu  Yeneng</title>
      <link>https://dev.to/kmusicman</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kmusicman"/>
    <language>en</language>
    <item>
      <title>Stop Your OpenAI Bill from Exploding: Per-User LLM Budget Caps in Node.js</title>
      <dc:creator>Phasu  Yeneng</dc:creator>
      <pubDate>Mon, 04 May 2026 14:44:32 +0000</pubDate>
      <link>https://dev.to/kmusicman/stop-your-openai-bill-from-exploding-per-user-llm-budget-caps-in-nodejs-48c8</link>
      <guid>https://dev.to/kmusicman/stop-your-openai-bill-from-exploding-per-user-llm-budget-caps-in-nodejs-48c8</guid>
      <description>&lt;h2&gt;
  
  
  The cost incident that started this
&lt;/h2&gt;

&lt;p&gt;Three weeks after we put our chatbot into production, I opened the OpenAI billing dashboard on a Monday morning and stopped breathing for a second. One session — not one user, one &lt;em&gt;session&lt;/em&gt; — had burned through roughly four times the daily budget for the entire app. Over a single afternoon.&lt;/p&gt;

&lt;p&gt;The session wasn't malicious. It was a test account someone forgot to log out of, hammering the chat endpoint in the background while reloading a broken page. No rate limit was breached. No alarm fired. No infrastructure metric looked unusual. The only place it showed up was the bill at the end of the month.&lt;/p&gt;

&lt;p&gt;That was the day I learned that &lt;strong&gt;rate limits and budget limits are not the same thing&lt;/strong&gt;, and that running an LLM-powered app without a per-user cost cap is roughly the same as putting a credit card behind a public form and hoping nobody fills it in 800 times.&lt;/p&gt;

&lt;p&gt;This post walks through the pattern I now use in every Node.js + Express app that talks to OpenAI: &lt;strong&gt;track first, then cap, then degrade gracefully, then cache aggressively.&lt;/strong&gt; It's framework-agnostic, Postgres-backed, and a fresh team can ship it in a single afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "rate limit" is the wrong abstraction
&lt;/h2&gt;

&lt;p&gt;Classic rate limiting nudges you toward thinking in &lt;em&gt;requests per minute&lt;/em&gt;. That model works fine for a REST API where every request costs roughly the same. It falls apart for LLM APIs because &lt;strong&gt;request count is decoupled from cost&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Compare two requests to the same endpoint:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A 50-token &lt;em&gt;"what's your refund policy?"&lt;/em&gt; question → roughly &lt;strong&gt;$0.0005&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A 50-page pasted document with &lt;em&gt;"summarize this"&lt;/em&gt; prompt → roughly &lt;strong&gt;$0.30&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a 600× cost spread on two requests that both count as "1" to a token bucket. If your rate limit is 60 requests/minute, an attacker — or a buggy client, or a curious power user — can drive your bill into triple digits per hour while staying perfectly within rate-limit bounds.&lt;/p&gt;

&lt;p&gt;You need to cap the dollar value, not the request count.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Measure before you cap
&lt;/h2&gt;

&lt;p&gt;You cannot cap what you do not measure. The first thing to ship is a single funnel that every LLM call passes through, with structured logging into a real database (not a JSON file, not an analytics tool — something you can &lt;code&gt;JOIN&lt;/code&gt; and &lt;code&gt;WHERE&lt;/code&gt; against in real time).&lt;/p&gt;

&lt;p&gt;Here's the schema I use, simplified:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;llm_usage_logs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;                     &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;session_id&lt;/span&gt;             &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;                  &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;prompt_tokens&lt;/span&gt;          &lt;span class="nb"&gt;INT&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;cached_prompt_tokens&lt;/span&gt;   &lt;span class="nb"&gt;INT&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;-- subset of prompt_tokens that hit OpenAI's cache&lt;/span&gt;
  &lt;span class="n"&gt;completion_tokens&lt;/span&gt;      &lt;span class="nb"&gt;INT&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;total_tokens&lt;/span&gt;           &lt;span class="nb"&gt;INT&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;prompt_cost_usd&lt;/span&gt;        &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;cached_prompt_cost_usd&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- billed at ~50% of full prompt rate&lt;/span&gt;
  &lt;span class="n"&gt;completion_cost_usd&lt;/span&gt;    &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;total_cost_usd&lt;/span&gt;         &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;finish_reason&lt;/span&gt;          &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;response_time_ms&lt;/span&gt;       &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;             &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;llm_usage_logs_session_time_idx&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;llm_usage_logs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three non-obvious choices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Store cost in USD as &lt;code&gt;NUMERIC&lt;/code&gt;, not as cents in an integer.&lt;/strong&gt; Token-priced cost has 4–6 significant decimal digits. If you store cents, you'll round most short calls to zero and the arithmetic gets useless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index on &lt;code&gt;(session_id, created_at DESC)&lt;/code&gt;.&lt;/strong&gt; Every "is this user over budget?" query scans recent rows for a session. Without this index it's a sequential scan, and you'll regret it the day usage spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provision the cached-token columns from day one.&lt;/strong&gt; Even if you're not using prompt caching yet, adding &lt;code&gt;cached_prompt_tokens&lt;/code&gt; and &lt;code&gt;cached_prompt_cost_usd&lt;/code&gt; up front saves you a migration later — they default to &lt;code&gt;0&lt;/code&gt; and Step 2 wires them up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then a single logging function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;logLLMCall&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;finish_reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response_time_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// All prices below are USD per 1M tokens (NOT per 1K).&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;promptPricePerM&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LLM_PROMPT_PRICE_PER_M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mf"&gt;2.50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completionPricePerM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LLM_COMPLETION_PRICE_PER_M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mf"&gt;10.00&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;promptCost&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt_tokens&lt;/span&gt;     &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;promptPricePerM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completionCost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;completion_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;completionPricePerM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;totalCost&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;promptCost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;completionCost&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`INSERT INTO llm_usage_logs
       (session_id, model, prompt_tokens, completion_tokens, total_tokens,
        prompt_cost_usd, completion_cost_usd, total_cost_usd,
        finish_reason, response_time_ms)
     VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="nx"&gt;promptCost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;completionCost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;totalCost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;finish_reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response_time_ms&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2 — Get the cost math right
&lt;/h2&gt;

&lt;p&gt;Three things that bite people in cost calculation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Different pricing per model.&lt;/strong&gt; GPT-4o costs roughly 5× what GPT-4o-mini costs per million tokens. Don't hardcode prices; pull them from env vars (or a small &lt;code&gt;model_pricing&lt;/code&gt; table) keyed by model name. When OpenAI announces new pricing — and they will — you change config, not code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust the API's token count, not your local one.&lt;/strong&gt; Tokenizers like &lt;code&gt;tiktoken&lt;/code&gt; are &lt;em&gt;close&lt;/em&gt; to what the API actually charges, but not identical. The number that matters is &lt;code&gt;response.usage.{prompt,completion}_tokens&lt;/code&gt; returned in the API response. Log that, not your local pre-call estimate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming responses still report usage — if you ask for it.&lt;/strong&gt; With &lt;code&gt;stream: true&lt;/code&gt;, you must pass &lt;code&gt;stream_options: { include_usage: true }&lt;/code&gt; to get a final usage chunk. Many people miss this and end up logging &lt;code&gt;0&lt;/code&gt; tokens for every streamed call, which silently zeroes their cost dashboard.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;stream_options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;include_usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;// arrives in the final chunk&lt;/span&gt;
  &lt;span class="c1"&gt;// ... yield chunk.choices[0].delta to client&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;logLLMCall&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;timing&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Don't forget prompt caching — it changes the math
&lt;/h3&gt;

&lt;p&gt;If you've enabled OpenAI prompt caching (it kicks in automatically once your prompt prefix is long and reused), part of your prompt tokens come back at roughly &lt;strong&gt;half price&lt;/strong&gt;. They show up under &lt;code&gt;usage.prompt_tokens_details.cached_tokens&lt;/code&gt;. If you ignore the field, your dashboard will overstate spend by 20–30% — and worse, you'll under-credit the optimizations you're actually doing.&lt;/p&gt;

&lt;p&gt;Three-rate calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;calculateCost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prompt_tokens_details&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;cached_tokens&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;promptRaw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prompt_tokens&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// billed at full rate&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completion_tokens&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// All prices below are USD per 1M tokens.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fullPromptPerM&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LLM_PROMPT_PRICE_PER_M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mf"&gt;2.50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cachedPromptPerM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LLM_CACHED_PROMPT_PRICE_PER_M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mf"&gt;1.25&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// ~50% off full&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completionPerM&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LLM_COMPLETION_PRICE_PER_M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mf"&gt;10.00&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;prompt_cost_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;promptRaw&lt;/span&gt;  &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;fullPromptPerM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;cached_prompt_cost_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cached&lt;/span&gt;     &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;cachedPromptPerM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;completion_cost_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;completionPerM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;total_cost_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;promptRaw&lt;/span&gt;  &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;fullPromptPerM&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cached&lt;/span&gt;     &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;cachedPromptPerM&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;completionPerM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Concrete example: a 4,000-token system prompt with a 2,000-token cache hit, plus 200 completion tokens on &lt;code&gt;gpt-4o&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Rate (per 1M)&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt (uncached)&lt;/td&gt;
&lt;td&gt;2,000&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$0.005000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt (cached)&lt;/td&gt;
&lt;td&gt;2,000&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$0.002500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Completion&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;$0.002000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0095&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Without caching that same call would be $0.0120 — about &lt;strong&gt;21% more expensive&lt;/strong&gt;. Add a &lt;code&gt;cached_prompt_cost_usd&lt;/code&gt; column to your &lt;code&gt;llm_usage_logs&lt;/code&gt; table and you'll be able to track caching ROI directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on currency
&lt;/h3&gt;

&lt;p&gt;If you operate in a non-USD currency (we report in THB internally), &lt;strong&gt;store the canonical cost as USD and convert at query time.&lt;/strong&gt; Exchange rates drift; locking yesterday's rate into the row makes month-over-month comparisons quietly lie to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Per-session budget middleware
&lt;/h2&gt;

&lt;p&gt;Now that costs are visible, add an Express middleware that runs &lt;em&gt;before&lt;/em&gt; the LLM call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;budgetGuard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sessionId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tier&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anonymous&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dailyCapUsd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;anonymous&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;free&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mf"&gt;2.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;paid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;20.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;internal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;100.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="nx"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`SELECT COALESCE(SUM(total_cost_usd), 0)::float AS spent
       FROM llm_usage_logs
      WHERE session_id = $1
        AND created_at &amp;gt; NOW() - INTERVAL '24 hours'`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;spent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;spent&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;spent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dailyCapUsd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dailyCapUsd&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;spent&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spent&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;dailyCapUsd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;daily_budget_exceeded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;You have hit your daily usage limit. Try again in 24 hours.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;spent_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;spent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;cap_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dailyCapUsd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/chat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;budgetGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;chatHandler&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Security note on the identity you cap against.&lt;/strong&gt; The example above falls back to &lt;code&gt;req.body.session_id&lt;/code&gt; for clarity, but in production &lt;strong&gt;never trust an identifier the client can rotate&lt;/strong&gt;. A hostile (or just curious) client can change &lt;code&gt;session_id&lt;/code&gt; on every request and dodge the cap entirely. Pull the identity from a verified source: a signed cookie session, a JWT subject claim, or the authenticated user object set by your auth middleware. Treat the body fallback as prototype-only — replace it with the real authenticated principal before anything ships.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A few real-world refinements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tier the cap.&lt;/strong&gt; Anonymous traffic gets the smallest budget, paid users get more. Don't give every visitor a $20/day allowance by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;24h sliding window, not calendar day.&lt;/strong&gt; A user who maxes out at 23:59 shouldn't get a fresh budget at 00:00.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attach the budget object to &lt;code&gt;req&lt;/code&gt;.&lt;/strong&gt; Downstream handlers can read &lt;code&gt;req.budget.remaining&lt;/code&gt; to make smarter decisions — see the next section.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4 — Soft cap → hard cap → fallback (the 3-tier pattern)
&lt;/h2&gt;

&lt;p&gt;A single threshold is too binary. Instead, treat the cap as three zones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Zone&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Green&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 80% of cap&lt;/td&gt;
&lt;td&gt;Use the premium model. Business as usual.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Yellow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;80–100%&lt;/td&gt;
&lt;td&gt;Log a warning, ping &lt;code&gt;#alerts&lt;/code&gt; on Slack, degrade to a cheaper model (e.g. &lt;code&gt;gpt-4o-mini&lt;/code&gt;). User keeps getting answers.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Red&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;gt; 100%&lt;/td&gt;
&lt;td&gt;Hard-refuse with a friendly message.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The decision flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
    A[Incoming chat request] --&amp;gt; B[Look up 24h spend for user]
    B --&amp;gt; C{spent &amp;lt; 80% of cap?}
    C -- yes --&amp;gt; D["🟢 GREEN&amp;lt;br/&amp;gt;use premium model"]
    C -- no --&amp;gt; E{spent &amp;lt; 100% of cap?}
    E -- yes --&amp;gt; F["🟡 YELLOW&amp;lt;br/&amp;gt;fall back to cheap model&amp;lt;br/&amp;gt;+ Slack alert"]
    E -- no --&amp;gt; G["🔴 RED&amp;lt;br/&amp;gt;return 429&amp;lt;br/&amp;gt;budget_exceeded"]
    D --&amp;gt; H[Call LLM, log usage, respond]
    F --&amp;gt; H
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;pickModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;spent&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;            &lt;span class="c1"&gt;// upstream middleware already 429'd&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// graceful degradation&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Yellow zone is the trick that makes this &lt;strong&gt;user-friendly instead of abusive&lt;/strong&gt;. The product still works at 90% of cap; it's just running on a cheaper engine. Most users won't notice. Engineers who wake up to the Slack alert can investigate before the wall is hit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Caching is the largest cost cut you can make
&lt;/h2&gt;

&lt;p&gt;Logging and capping reduces &lt;em&gt;runaway&lt;/em&gt; spend. Caching reduces &lt;em&gt;baseline&lt;/em&gt; spend, often by 20–50%. Two layers, in order of effort:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact-match cache (cheap).&lt;/strong&gt; Normalize the question (lowercase, trim, collapse whitespace). If you've answered the exact same thing in the last 24h, return the stored answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;findExactMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`SELECT answer FROM llm_usage_logs
      WHERE question = $1 AND answer IS NOT NULL
      ORDER BY created_at DESC
      LIMIT 1`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Semantic cache (high ROI).&lt;/strong&gt; Embed the incoming question, look for past questions with cosine similarity ≥ 0.95, return their stored answers. This catches &lt;em&gt;"what is your refund policy?"&lt;/em&gt; vs &lt;em&gt;"how do refunds work?"&lt;/em&gt; — textually different, semantically identical. If your hit rate is 30%, you've cut 30% off your bill the day you ship it.&lt;/p&gt;

&lt;p&gt;A practical tip: &lt;strong&gt;don't cache personalized or stateful answers.&lt;/strong&gt; Cache the FAQ-style stuff. A simple &lt;code&gt;cacheable: true&lt;/code&gt; flag on the prompt template handles this cleanly without leaking one user's data to another.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6 — Observability you'll actually look at
&lt;/h2&gt;

&lt;p&gt;Build one endpoint that returns this morning's numbers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/llm/budget&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`
    SELECT
      DATE(created_at)            AS date,
      model,
      COUNT(*)                    AS calls,
      SUM(total_tokens)           AS tokens,
      SUM(total_cost_usd)::float  AS cost_usd
    FROM llm_usage_logs
    WHERE created_at &amp;gt; NOW() - INTERVAL '30 days'
    GROUP BY DATE(created_at), model
    ORDER BY date DESC
  `&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipe it into a small HTML dashboard or a Grafana board. Set one alert worth its salt: &lt;strong&gt;&amp;gt;30% above the 7-day rolling average triggers a Slack ping.&lt;/strong&gt; That single alert has caught every cost incident I've had since I shipped it.&lt;/p&gt;

&lt;p&gt;A simple version of the dashboard looks roughly like this — replace with a real screenshot once you have one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│  LLM SPEND — last 7 days                              ↻ refresh  │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   $12 ┤                                            ▇             │
│   $10 ┤             ▇                              ▇             │
│    $8 ┤      ▇      ▇      ▇             ▇        ▇             │
│    $6 ┤      ▇      ▇      ▇      ▇      ▇        ▇             │
│    $4 ┤▇     ▇      ▇      ▇      ▇      ▇        ▇             │
│    $2 ┤▇     ▇      ▇      ▇      ▇      ▇        ▇             │
│    $0 ┴┴─────┴──────┴──────┴──────┴──────┴────────┴───           │
│       Mon   Tue    Wed    Thu    Fri    Sat      Sun ⚠ +47%      │
│                                                                  │
│  Today          $11.83   ████████████████  (cap $20)             │
│  Top session    sess_8f2 $4.12  • 312 calls  • gpt-4o            │
│  Cache hit      31.4%    (≈ $5.20 saved today)                   │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three numbers I look at every morning: &lt;strong&gt;today's total&lt;/strong&gt;, &lt;strong&gt;the highest-spend session&lt;/strong&gt;, and &lt;strong&gt;cache hit rate&lt;/strong&gt;. Anything weird shows up in one of those three.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfalls I learned the hard way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tool calls aren't always counted the way you expect.&lt;/strong&gt; Older SDK versions occasionally return &lt;code&gt;0&lt;/code&gt; for &lt;code&gt;completion_tokens&lt;/code&gt; when the response is a tool call instead of plain text. Always log the full &lt;code&gt;usage&lt;/code&gt; object and verify against your dashboard once a week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retries double-count.&lt;/strong&gt; If your HTTP client retries a 5xx, you can log the same call twice. Pass an idempotency key — the request ID works — and dedupe on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logging on the hot path.&lt;/strong&gt; Don't &lt;code&gt;await&lt;/code&gt; the log insert before responding to the user; the user is waiting on you. Fire-and-forget with a try/catch, or push to a small in-memory queue and drain in the background. But &lt;strong&gt;bound the queue size&lt;/strong&gt;: an unbounded queue is just a memory leak waiting for traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long system prompts compound.&lt;/strong&gt; A 4,000-token system prompt × 10 messages per session × 1,000 sessions/day is 40M prompt tokens, even if every user message is short. Use OpenAI's prompt caching where supported, and trim the system prompt aggressively. Most "best practice" system prompts on the internet are 3× longer than they need to be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cached input is cheaper — and the field is easy to miss.&lt;/strong&gt; Prompt caching lives in &lt;code&gt;usage.prompt_tokens_details.cached_tokens&lt;/code&gt;, not at the top level. We covered the math in Step 2; the practical pitfall is that older code paths often &lt;em&gt;only&lt;/em&gt; read &lt;code&gt;prompt_tokens&lt;/code&gt; and silently double-bill anything cached. Audit every place that reads usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this gets you
&lt;/h2&gt;

&lt;p&gt;After shipping the full pattern in our app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No more cost surprises.&lt;/strong&gt; The worst possible day is now bounded by &lt;code&gt;cap_per_user × active_users&lt;/code&gt;. We can math it out before launching a feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abuse is a non-event.&lt;/strong&gt; The same loop that burned through a daily budget in three hours now hits the 429 in twelve minutes and stays quiet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spend is steerable.&lt;/strong&gt; When we want to spend less, we tighten the Yellow threshold. When we want to spend more on quality, we raise it. It's a knob, not a panic button.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineers stopped being scared of LLM features.&lt;/strong&gt; "What if it gets expensive?" finally has an answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where to go next
&lt;/h2&gt;

&lt;p&gt;The version above is the minimum viable pattern. Once it's running, the natural extensions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-feature budgets&lt;/strong&gt; — chat, summarization, and embedding refresh each get their own cap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anomaly detection&lt;/strong&gt; — alert on cost-per-session 3σ above the mean instead of a fixed threshold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-throttle&lt;/strong&gt; — when the daily org-wide cap is approaching, slow down lower-tier users first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-flight estimation&lt;/strong&gt; — refuse a 200KB pasted blob &lt;em&gt;before&lt;/em&gt; it hits the API, not after.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each is a follow-up post. The foundation — log, cap, degrade, cache — is what stops the bleeding tonight.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you ship this pattern and it saves you money, I'd love to hear what your hit rate on the semantic cache ended up being. That number was the most surprising thing for me — much higher than I expected.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>javascript</category>
      <category>openai</category>
      <category>node</category>
    </item>
    <item>
      <title>9 Verified Tools to Stop Burning Claude Tokens Unnecessarily</title>
      <dc:creator>Phasu  Yeneng</dc:creator>
      <pubDate>Mon, 20 Apr 2026 14:03:30 +0000</pubDate>
      <link>https://dev.to/kmusicman/9-verified-tools-to-stop-burning-claude-tokens-unnecessarily-f9e</link>
      <guid>https://dev.to/kmusicman/9-verified-tools-to-stop-burning-claude-tokens-unnecessarily-f9e</guid>
      <description>&lt;p&gt;You're not using Claude more — you're just wasting more context.&lt;/p&gt;

&lt;p&gt;I went looking for real, working tools after seeing a widely-shared list that mixed legitimate repos with hallucinated ones. This article only covers tools I could verify on GitHub, organized by the type of waste they fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why tokens disappear faster than you expect
&lt;/h2&gt;

&lt;p&gt;Before the tools: understanding &lt;em&gt;where&lt;/em&gt; tokens actually go.&lt;/p&gt;

&lt;p&gt;Most developers assume their prompts are the main cost. They're not. The real culprits are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Verbose model output&lt;/strong&gt; — Claude explaining what it's about to do, then doing it, then summarizing what it did&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw CLI output&lt;/strong&gt; — dumping full &lt;code&gt;git log&lt;/code&gt;, &lt;code&gt;npm install&lt;/code&gt;, or test runner output directly into context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bloated CLAUDE.md&lt;/strong&gt; — this file loads on &lt;em&gt;every&lt;/em&gt; turn before Claude reads a single line of your code. A 5,000-token CLAUDE.md costs 5,000 tokens per message, before you've typed a word&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code navigation by content&lt;/strong&gt; — when Claude reads entire files to find one function instead of navigating by symbol&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ghost tokens&lt;/strong&gt; — leftover context from earlier in the session that no longer contributes to the task but still costs money every turn&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each category has a different fix. Here's what actually works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 1 — Shrink what Claude writes back
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/juliusbrussee/caveman" rel="noopener noreferrer"&gt;Caveman&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;The simplest idea with surprisingly good results: make Claude talk like a caveman. Short words. No filler. Still technically accurate.&lt;/p&gt;

&lt;p&gt;It ships as a Claude Code skill that cuts ~65–75% of output tokens while preserving full technical accuracy. The compression is aggressive but the signal stays intact — you get &lt;code&gt;fix auth bug in login.js line 42&lt;/code&gt; instead of three paragraphs explaining what the fix does.&lt;/p&gt;

&lt;p&gt;Works as a plugin for Cursor, Windsurf, Cline, and others too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# install as a Claude Code skill&lt;/span&gt;
&lt;span class="c"&gt;# see: https://github.com/juliusbrussee/caveman&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Long coding sessions where Claude's explanations are eating your context budget.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;a href="https://github.com/drona23/claude-token-efficient" rel="noopener noreferrer"&gt;claude-token-efficient&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Single &lt;code&gt;CLAUDE.md&lt;/code&gt; file. Drop it into your project. Done.&lt;/p&gt;

&lt;p&gt;It bakes response-terseness rules directly into Claude's instructions, forcing shorter output on heavy workflows without you having to prompt for it every time. No code changes, no new dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Projects where you want a permanent "be concise" baseline without running an extra tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 2 — Compress what you send in
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/rtk-ai/rtk" rel="noopener noreferrer"&gt;RTK (Rust Token Killer)&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;A CLI proxy that filters terminal output before it reaches Claude. Instead of Claude seeing 2,000 tokens of raw &lt;code&gt;git status&lt;/code&gt;, it sees ~200 tokens of the relevant parts.&lt;/p&gt;

&lt;p&gt;The hook transparently rewrites shell commands — &lt;code&gt;git status&lt;/code&gt; becomes &lt;code&gt;rtk git status&lt;/code&gt; — and Claude never sees the rewrite, just the compressed result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# without rtk&lt;/span&gt;
git status  → ~2,000 tokens raw output

&lt;span class="c"&gt;# with rtk&lt;/span&gt;
rtk git status  → ~200 tokens filtered output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claims 60–90% reduction on common dev commands. Single Rust binary, zero dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams running agentic workflows where Claude executes a lot of shell commands.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;a href="https://github.com/chopratejas/headroom" rel="noopener noreferrer"&gt;Headroom&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;A context optimization layer that sits between your app and the LLM. Three compression modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SmartCrusher&lt;/strong&gt; — JSON compression&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CodeCompressor&lt;/strong&gt; — AST-aware code compression (understands structure, not just text)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kompress-base&lt;/strong&gt; — general text compression&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AST-aware approach is the interesting one. It doesn't just truncate code — it understands which parts of a file are structurally relevant and compresses accordingly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Applications (not just Claude Code) that programmatically build context before sending to any LLM.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;a href="https://github.com/nadimtuhin/claude-token-optimizer" rel="noopener noreferrer"&gt;claude-token-optimizer&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Reusable CLAUDE.md setup prompts that structure your documentation so Claude only loads what it needs per task.&lt;/p&gt;

&lt;p&gt;One real-world example from the repo: a RedwoodJS project reduced session start from 11,000 tokens down to 1,300 by restructuring which docs load when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Projects with large documentation that Claude currently loads all at once.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 3 — MCP-level optimization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/ooples/token-optimizer-mcp" rel="noopener noreferrer"&gt;Token Optimizer MCP&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Intelligent token optimization for Claude Code via MCP. Claims 95%+ reduction through caching, compression, and smart tool intelligence — meaning it tracks which tools Claude actually uses and optimizes the tool definitions it sends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Claude Code users running MCP-heavy workflows.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;a href="https://glama.ai/mcp/servers/woling-dev/promptthrift-mcp" rel="noopener noreferrer"&gt;PromptThrift MCP&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Compresses conversation history using a local Gemma 4 model (runs on your machine, no extra API cost) or heuristic fallback. Key feature: &lt;strong&gt;pinned facts&lt;/strong&gt; — you can mark specific context as protected so it survives compression.&lt;/p&gt;

&lt;p&gt;Claims 70–90% reduction on conversation history while keeping critical context intact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Long multi-turn sessions where early context becomes expensive dead weight.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;a href="https://github.com/Mibayy/token-savior" rel="noopener noreferrer"&gt;Token Savior&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;MCP server specifically built for code navigation. Instead of Claude reading entire files to find a function, it indexes your codebase by symbol — functions, classes, imports, call graph — and navigates by pointer.&lt;/p&gt;

&lt;p&gt;Also includes a persistent memory engine that stores decisions, conventions, and session summaries in SQLite and re-injects them as a compact delta at session start.&lt;/p&gt;

&lt;p&gt;Claims 97% reduction on code navigation across 170+ real sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Large codebases where Claude spends significant tokens just finding the right file and function.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 4 — Clean up ghost tokens
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/alexgreensh/token-optimizer" rel="noopener noreferrer"&gt;Token Optimizer&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Targets "ghost tokens" — context that's still technically in the window but no longer relevant to the current task. Also helps survive context compaction without losing quality.&lt;/p&gt;

&lt;p&gt;More of a diagnostic and cleanup tool than a compression layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Sessions that run long and accumulate stale context from earlier tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The free fix most people skip
&lt;/h2&gt;

&lt;p&gt;Before installing anything: &lt;strong&gt;audit your CLAUDE.md&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;According to Claude Code's official documentation, CLAUDE.md loads before every response, before Claude reads your code, before anything else. It's the most expensive file in your project on a per-token basis.&lt;/p&gt;

&lt;p&gt;The recommended limit is &lt;strong&gt;under 200 lines&lt;/strong&gt;. If yours is longer, move sections into on-demand skill files that only load when invoked. Most CLAUDE.md files I've seen in the wild are 3–5x longer than they need to be.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it fixes&lt;/th&gt;
&lt;th&gt;Claimed reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/juliusbrussee/caveman" rel="noopener noreferrer"&gt;Caveman&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Verbose model output&lt;/td&gt;
&lt;td&gt;~65–75% output tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/drona23/claude-token-efficient" rel="noopener noreferrer"&gt;claude-token-efficient&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Verbose model output&lt;/td&gt;
&lt;td&gt;Drop-in terseness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/rtk-ai/rtk" rel="noopener noreferrer"&gt;RTK&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Raw CLI output in context&lt;/td&gt;
&lt;td&gt;60–90% on shell commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/chopratejas/headroom" rel="noopener noreferrer"&gt;Headroom&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Input context size&lt;/td&gt;
&lt;td&gt;AST-aware compression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/nadimtuhin/claude-token-optimizer" rel="noopener noreferrer"&gt;claude-token-optimizer&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Doc structure / loading&lt;/td&gt;
&lt;td&gt;11k → 1.3k session start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/ooples/token-optimizer-mcp" rel="noopener noreferrer"&gt;Token Optimizer MCP&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;MCP tool definitions&lt;/td&gt;
&lt;td&gt;95%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers/woling-dev/promptthrift-mcp" rel="noopener noreferrer"&gt;PromptThrift MCP&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Conversation history&lt;/td&gt;
&lt;td&gt;70–90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Mibayy/token-savior" rel="noopener noreferrer"&gt;Token Savior&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Code navigation&lt;/td&gt;
&lt;td&gt;~97%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/alexgreensh/token-optimizer" rel="noopener noreferrer"&gt;Token Optimizer&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Ghost tokens / session health&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The numbers above come from each project's own documentation — treat them as upper bounds under ideal conditions, not guarantees.&lt;/p&gt;

&lt;p&gt;Pick one from Category 1 and one from Category 2. Those two changes alone will have the most impact on day-to-day cost for most workflows.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>productivity</category>
      <category>tools</category>
    </item>
    <item>
      <title>Why your RAG chatbot fails in Thai — and how to fix it</title>
      <dc:creator>Phasu  Yeneng</dc:creator>
      <pubDate>Sun, 19 Apr 2026 10:08:22 +0000</pubDate>
      <link>https://dev.to/kmusicman/why-your-rag-chatbot-fails-in-thai-and-how-to-fix-it-3m72</link>
      <guid>https://dev.to/kmusicman/why-your-rag-chatbot-fails-in-thai-and-how-to-fix-it-3m72</guid>
      <description>&lt;h2&gt;
  
  
  Why your RAG chatbot fails in Thai — and how to fix it
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A real-world walkthrough of how we built a customer service chatbot for a Thai e-commerce company — and the chunking problem nobody warns you about.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;When I started building a RAG (Retrieval-Augmented Generation) chatbot for a Thai e-commerce company, I made the same mistake every developer makes: I copied the LangChain quickstart example, set &lt;code&gt;chunk_size=500&lt;/code&gt;, and expected things to just work.&lt;/p&gt;

&lt;p&gt;They didn't.&lt;/p&gt;

&lt;p&gt;This is the story of why naive chunking fails for Thai text, what we built instead, and the full pipeline from PDF product manuals to chatbot answers — using Python, Qdrant, and OpenAI.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Warns You About
&lt;/h2&gt;

&lt;p&gt;Most RAG tutorials are written with English in mind. The chunking logic looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Works fine for English
&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# or
&lt;/span&gt;&lt;span class="n"&gt;text_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works because English has clear word boundaries — spaces between every word. When you split on periods or character count, you still get coherent, searchable chunks.&lt;/p&gt;

&lt;p&gt;Thai is completely different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thai has no spaces between words.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This sentence — "ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ" — means "Our store has many product categories to choose from." But to a naive chunker, it looks like one enormous, unsplittable blob. There are 7 meaningful words in there, with zero whitespace between them.&lt;/p&gt;

&lt;p&gt;Here's what happens when you embed that raw blob versus properly tokenized words:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input to embedding model&lt;/th&gt;
&lt;th&gt;What it sees&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;One opaque token sequence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`ร้านค้า \&lt;/td&gt;
&lt;td&gt;ของเรา \&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The second form produces embeddings that actually capture the meaning of each concept — "store", "product", "category" — which leads to better retrieval when a user asks "มีสินค้าหมวดหมู่ไหนบ้าง" (what product categories are available?).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline We Built
&lt;/h2&gt;

&lt;p&gt;Here's the full architecture:&lt;br&gt;
{% raw %}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF product manuals / FAQ documents
    |
Python (PyMuPDF) → extract raw text
    |
Sentence splitting by '. '
    |
[Stored in MongoDB as raw sentences]
    |
Python → pythainlp tokenization
    |
OpenAI text-embedding-3-small
    |
Qdrant vector database (cosine similarity, 1536 dims)
    |
User query → tokenize → embed → search → top-7 chunks
    |
GPT-4o-mini + context → answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Let's walk through each step with real code. Here are the dependencies we'll use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# requirements.txt
&lt;/span&gt;&lt;span class="py"&gt;pymupdf&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=1.27.2.2&lt;/span&gt;
&lt;span class="py"&gt;pythainlp&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=5.2.0&lt;/span&gt;
&lt;span class="py"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=2.32.0&lt;/span&gt;
&lt;span class="py"&gt;qdrant-client&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=1.17.1&lt;/span&gt;
&lt;span class="py"&gt;pymongo&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=4.10.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1 — Extract Text from PDF
&lt;/h2&gt;

&lt;p&gt;We use &lt;code&gt;PyMuPDF&lt;/code&gt; (the &lt;code&gt;fitz&lt;/code&gt; library) instead of &lt;code&gt;PyPDF2&lt;/code&gt; because it handles Thai character encoding much more reliably.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# app/python/PdfToSentences.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pymupdf&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fitz&lt;/span&gt;  &lt;span class="c1"&gt;# PyMuPDF 1.27+ (legacy: import fitz)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_sentences_from_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pdf_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fitz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pdf_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Split on English period + space — works for mixed Thai/English documents
&lt;/span&gt;    &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cleaned_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;•&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Remove bullet points
&lt;/span&gt;    &lt;span class="n"&gt;cleaned_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cleaned_text&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cleaned_text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to note here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;PyMuPDF&lt;/code&gt; over &lt;code&gt;PyPDF2&lt;/code&gt;?&lt;/strong&gt; Thai PDF documents often use non-standard font encodings. &lt;code&gt;PyMuPDF&lt;/code&gt; handles these much better — with &lt;code&gt;PyPDF2&lt;/code&gt; you'd frequently get garbled output or empty strings for Thai text blocks. Note: as of PyMuPDF 1.24+, the recommended import is &lt;code&gt;import pymupdf&lt;/code&gt; (the old &lt;code&gt;import fitz&lt;/code&gt; still works but is considered legacy).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why split on &lt;code&gt;.&lt;/code&gt; (period + space)?&lt;/strong&gt; Our documents are mixed Thai/English — product names, SKUs, and technical specs are often in English, while descriptions are Thai. The period-space split is a pragmatic middle ground that preserves Thai paragraphs as single chunks rather than fragmenting them randomly at character 500.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Limitation:&lt;/strong&gt; Formal Thai text often ends paragraphs with a line break rather than a period. If your PDFs have no periods at all, &lt;code&gt;text.split('. ')&lt;/code&gt; will return one giant chunk per page. In that case, use &lt;code&gt;pythainlp&lt;/code&gt;'s sentence tokenizer instead:&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pythainlp.tokenize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sent_tokenize&lt;/span&gt;
&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sent_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crfcut&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 2 — Thai Word Tokenization Before Embedding
&lt;/h2&gt;

&lt;p&gt;This is the most important step, and the one that differs most from English RAG.&lt;/p&gt;

&lt;p&gt;Before sending any Thai text to the embedding model, we tokenize it with &lt;code&gt;pythainlp&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# thai_tokenizer.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pythainlp.tokenize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;word_tokenize&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;word_cut&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;word_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;newmm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Join with pipe separator so the embedding model sees distinct units
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pythainlp&lt;/code&gt; uses a dictionary-based approach (&lt;code&gt;newmm&lt;/code&gt; engine) to segment Thai text into individual words:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Input:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"สินค้าอิเล็กทรอนิกส์ราคาถูกส่งฟรี"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;Output:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"สินค้า|อิเล็กทรอนิกส์|ราคาถูก|ส่งฟรี"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the embedding model sees four distinct semantic units instead of one long string. The cosine similarity between "ส่งฟรี" (free shipping) and a user's query "จัดส่งฟรีไหม" (is shipping free?) will be much higher and more meaningful after proper tokenization.&lt;/p&gt;

&lt;p&gt;We also tried &lt;code&gt;attacut&lt;/code&gt; (a neural-network-based engine in &lt;code&gt;pythainlp&lt;/code&gt;) but settled on &lt;code&gt;newmm&lt;/code&gt; for its speed and dictionary coverage — important when your domain includes product jargon and Thai promotional phrases like "ลดราคา", "ส่งฟรี", "ผ่อนชำระ".&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3 — Generate and Store Embeddings
&lt;/h2&gt;

&lt;p&gt;We use OpenAI's &lt;code&gt;text-embedding-3-small&lt;/code&gt; for embeddings — the current-generation model that replaced &lt;code&gt;text-embedding-ada-002&lt;/code&gt;. It scores 44% on the MIRACL multilingual benchmark vs 31.4% for the old model, and costs 5x less. The key is that we tokenize &lt;strong&gt;before&lt;/strong&gt; embedding — not after:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ingest_embeddings.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;thai_tokenizer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;word_cut&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_module&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_embedding&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ✅ Tokenize Thai text FIRST
&lt;/span&gt;    &lt;span class="n"&gt;tokenized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;word_cut&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Then embed the tokenized version
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;      &lt;span class="c1"&gt;# store original for display
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;    &lt;span class="c1"&gt;# store original keyword
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embeded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;    &lt;span class="c1"&gt;# embed the tokenized version
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;sentences_collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice we store the &lt;strong&gt;original&lt;/strong&gt; text as the payload but create the embedding from the &lt;strong&gt;tokenized&lt;/strong&gt; version. This way, when a match is found, the chatbot returns the human-readable original sentence — not the pipe-separated tokenized form.&lt;/p&gt;

&lt;p&gt;The embedding function itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# openai_module.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;MAX_INPUT_LENGTH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_INPUT_LENGTH&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Text too long&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# replaces text-embedding-ada-002
&lt;/span&gt;        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;# if you change this, update Qdrant collection size too!
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4 — Qdrant as the Vector Store
&lt;/h2&gt;

&lt;p&gt;We use &lt;a href="https://qdrant.tech/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt; running in Docker as our vector database. It's fast, lightweight, and the REST API is straightforward to call with Python's &lt;code&gt;requests&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# qdrant_module.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;QDRANT_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QDRANT_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:6333&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_rag_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;QDRANT_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/collections/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatgpt_vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vector_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 1536 for text-embedding-3-small (default)
&lt;/span&gt;                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;distance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;QDRANT_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/collections/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/points/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;with_payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start Qdrant locally with one Docker command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-dt&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; VectorDB &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 6333:6333 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; /your/path/storage:/qdrant/storage &lt;span class="se"&gt;\&lt;/span&gt;
  qdrant/qdrant:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use &lt;strong&gt;Cosine similarity&lt;/strong&gt; rather than Euclidean distance. For semantic search in Thai, cosine similarity performs better because it measures the angle between vectors (meaning similarity) rather than the absolute distance, which is sensitive to text length differences.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5 — The RAG Query Flow
&lt;/h2&gt;

&lt;p&gt;When a user asks a question, here's what happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# chat_module.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_module&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_embedding&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_module&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;search&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Build a context-rich search query
&lt;/span&gt;    &lt;span class="n"&gt;search_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;สินค้า&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;category_name&lt;/span&gt;  &lt;span class="c1"&gt;# "Product [category]"
&lt;/span&gt;
    &lt;span class="c1"&gt;# 2. Embed the search query (tokenization happens upstream before this call)
&lt;/span&gt;    &lt;span class="n"&gt;question_embed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Search Qdrant for the top 7 most similar sentences
&lt;/span&gt;    &lt;span class="n"&gt;gpt_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatgpt_vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question_embed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="n"&gt;search_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatgpt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gpt_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Assemble context from the matched payloads
&lt;/span&gt;    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The assembled context is then injected into GPT-4o-mini's system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;system_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Use the attached context to answer the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s questions.
Answer only questions related to our company&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s products and services:

&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

ภาษาที่ใช้ตอบกลับ User ให้ยึดจากภาษาของคำถามล่าสุดของ User เท่านั้น&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last Thai instruction tells the model: &lt;em&gt;"Reply in the same language as the user's most recent message."&lt;/em&gt; This handles the bilingual nature of our users — some ask in Thai, some in English, some mix both.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6 — Question Classification Before RAG
&lt;/h2&gt;

&lt;p&gt;One non-obvious optimization: not every question needs a RAG lookup. We classify questions first with GPT-4o-mini to decide which path to take:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# chat_module.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;question_classification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;วิเคราะห์คำถามของ User ว่าเป็นคำถามประเภทไหน โดยให้ตอบเป็น JSON { &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: value }
    type 0 = ทักทาย / ไม่เกี่ยวกับสินค้าหรือบริการ
    type 1 = ถามเกี่ยวกับโปรโมชั่น / ส่วนลด / หมวดหมู่สินค้า
    type 2 = ถามเกี่ยวกับสาขา / พื้นที่จัดส่ง
    type 3 = ถามเกี่ยวกับข้อมูลสินค้าหรือบริการ  ← needs RAG
    type 4 = ถามทั่วไปเกี่ยวกับบริษัท&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only &lt;code&gt;type 3&lt;/code&gt; (specific product info questions) triggers the full RAG pipeline. Promotion and branch questions (&lt;code&gt;type 1-2&lt;/code&gt;) use structured data from a JSON catalog instead. Greetings (&lt;code&gt;type 0&lt;/code&gt;) go straight to the LLM without any retrieval at all.&lt;/p&gt;

&lt;p&gt;This classification step saves both latency and API cost — you're not doing a vector search for "สวัสดีครับ" (hello).&lt;/p&gt;




&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Tokenize before embedding, always.&lt;/strong&gt; The single biggest quality improvement came from running &lt;code&gt;pythainlp&lt;/code&gt; on every piece of text before it touches the embedding model — both at ingest time and at query time. Without this, retrieval quality was noticeably worse for Thai-only queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use PyMuPDF, not PyPDF2.&lt;/strong&gt; For Thai PDF documents, &lt;code&gt;PyMuPDF&lt;/code&gt; is dramatically more reliable. &lt;code&gt;PyPDF2&lt;/code&gt; would silently drop or garble Thai characters from complex layouts. Also note: as of v1.24+, use &lt;code&gt;import pymupdf&lt;/code&gt; instead of the legacy &lt;code&gt;import fitz&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Store original text, embed tokenized text.&lt;/strong&gt; Users should see natural language in responses. Keep these as separate fields.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Sentence-level chunks beat character-level chunks for Thai.&lt;/strong&gt; Because Thai sentences naturally carry complete thoughts, splitting at sentence boundaries (&lt;code&gt;.&lt;/code&gt;) gives the model coherent context units rather than arbitrary fragments. A &lt;code&gt;chunk_size=500&lt;/code&gt; cut might land in the middle of a Thai word — or more precisely, in the middle of a run of characters that spans multiple words, since there's no space to safely break at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Question classification as a router saves money.&lt;/strong&gt; Not every user message needs vector search. A cheap classification step routes simple questions to a direct LLM call and complex ones to the full RAG pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack at a Glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PDF extraction&lt;/td&gt;
&lt;td&gt;PyMuPDF (&lt;code&gt;pymupdf&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;1.27.2.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thai tokenization&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pythainlp&lt;/code&gt; (&lt;code&gt;newmm&lt;/code&gt; engine)&lt;/td&gt;
&lt;td&gt;5.2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding model&lt;/td&gt;
&lt;td&gt;OpenAI &lt;code&gt;text-embedding-3-small&lt;/code&gt; (1536d)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector database&lt;/td&gt;
&lt;td&gt;Qdrant + &lt;code&gt;qdrant-client&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;1.17.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;OpenAI GPT-4o-mini&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI SDK&lt;/td&gt;
&lt;td&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.32.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;Python / FastAPI or Flask&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chat history&lt;/td&gt;
&lt;td&gt;MongoDB&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Building RAG for Thai taught me that most of the "standard" chunking advice assumes English. Once you work with a language that has no word boundaries, the whole pipeline has to be rethought — from how you split sentences to how you normalize text before embedding.&lt;/p&gt;

&lt;p&gt;The good news: the fix is not complicated. A single tokenization step with &lt;code&gt;pythainlp&lt;/code&gt; before embedding makes a significant difference. The hard part is knowing you need it in the first place.&lt;/p&gt;

&lt;p&gt;If you're building RAG for other Asian languages — Japanese, Chinese, Korean — the same principle applies. Never assume your text has whitespace-delimited tokens. Always pre-process with a language-appropriate tokenizer before hitting your embedding model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Addendum — "But where's the benchmark?" &lt;em&gt;(added after publication)&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;After this post went up, a sharp comment pushed back on the central claim — and they were right to. The question is worth surfacing in the article itself, because the answer matters for anyone considering this technique:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Embedding models like &lt;code&gt;text-embedding-3-small&lt;/code&gt; already have an internal BPE tokenizer. Pre-tokenizing with &lt;code&gt;newmm&lt;/code&gt; and joining with &lt;code&gt;|&lt;/code&gt; adds &lt;code&gt;|&lt;/code&gt; tokens to the sequence — possibly OOD relative to the model's training distribution. So what's the actual mechanism that makes pre-tokenization help?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two honest acknowledgments first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. This article does not contain a rigorous ablation.&lt;/strong&gt; The quality claims — including the &lt;code&gt;ส่งฟรี&lt;/code&gt; ↔ &lt;code&gt;จัดส่งฟรีไหม&lt;/code&gt; cosine-similarity example — come from production retrieval logs and qualitative observation, not from a controlled experiment with &lt;code&gt;recall@k&lt;/code&gt; or &lt;code&gt;MRR&lt;/code&gt; numbers. Treat the recommendation as &lt;em&gt;"this worked for our case"&lt;/em&gt; rather than &lt;em&gt;"this is empirically proven across Thai RAG generally."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The OOD concern is half-right.&lt;/strong&gt; &lt;code&gt;|&lt;/code&gt; (U+007C) is &lt;strong&gt;not&lt;/strong&gt; OOD at the vocabulary level — it's a single token in &lt;code&gt;cl100k_base&lt;/code&gt;, common in code, CSV, and markdown across the training data. But its &lt;strong&gt;co-occurrence pattern&lt;/strong&gt; with Thai script &lt;em&gt;is&lt;/em&gt; distributionally OOD: the model has rarely seen &lt;code&gt;"ไทย|สมุนไพร|พื้นบ้าน"&lt;/code&gt; during training, which can shift the resulting embedding toward a "structured/code-like" subspace. Whether that helps or hurts retrieval is an empirical question, not a settled one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Proposed mechanism (hypothesis, not proof)
&lt;/h3&gt;

&lt;p&gt;The reason pre-tokenization seems to help isn't the &lt;code&gt;|&lt;/code&gt; token itself — it's that &lt;strong&gt;a hard separator blocks BPE from merging across word boundaries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Thai is &lt;em&gt;scriptio continua&lt;/em&gt; — no whitespace between words. When BPE runs on raw Thai, it greedily merges subword units based on training-corpus frequency, which routinely produces tokens that &lt;strong&gt;span word boundaries&lt;/strong&gt;: the trailing subword of one word fuses with the leading subword of the next. The resulting token's embedding "blurs" the semantics of two distinct words into a single fragment.&lt;/p&gt;

&lt;p&gt;Insert any separator (&lt;code&gt;|&lt;/code&gt;, space, special token) and BPE merging halts at the separator. The tokens that come out align more closely with linguistic word boundaries, so the embedding reflects per-word semantics rather than a subword soup.&lt;/p&gt;

&lt;p&gt;If that hypothesis is correct, the predictions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pipe&lt;/code&gt; ≈ &lt;code&gt;space&lt;/code&gt; (both block cross-boundary merging)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pipe&lt;/code&gt; &amp;gt; &lt;code&gt;raw Thai&lt;/code&gt; (consistent with what this article reports)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pipe&lt;/code&gt; may be marginally worse than &lt;code&gt;space&lt;/code&gt; if &lt;code&gt;|&lt;/code&gt; injects distributional noise&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The ablation that should exist
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Input fed to the embedder&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Raw Thai, no pre-tokenization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;newmm&lt;/code&gt; tokens, &lt;strong&gt;space-joined&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;newmm&lt;/code&gt; tokens, **`\&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;{% raw %}&lt;code&gt;newmm&lt;/code&gt; tokens, special separator like &lt;code&gt;&amp;lt;sep&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Evaluate &lt;code&gt;recall@5&lt;/code&gt; and &lt;code&gt;MRR&lt;/code&gt; on a Thai QA dataset (TyDi-QA Thai works as an open option) or in-house labeled pairs. My prediction: A trails B/C/D clearly, while B and C land within noise of each other. If that pattern holds, the takeaway updates from &lt;em&gt;"pipe-tokenize before embedding"&lt;/em&gt; to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Insert any boundary marker that prevents BPE from merging across word boundaries. The marker itself doesn't matter much — the boundary signal does."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I plan to run this ablation and publish a follow-up with the actual numbers. If anyone runs it first, please share — happy to link it from here.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to the commenter who raised this. That's the kind of question that turns a heuristic into actual knowledge.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>python</category>
      <category>nlp</category>
      <category>ai</category>
    </item>
    <item>
      <title>25 Free Developer Tools That Run 100% in Your Browser</title>
      <dc:creator>Phasu  Yeneng</dc:creator>
      <pubDate>Sat, 18 Apr 2026 07:07:19 +0000</pubDate>
      <link>https://dev.to/kmusicman/25-free-developer-tools-that-run-100-in-your-browser-90</link>
      <guid>https://dev.to/kmusicman/25-free-developer-tools-that-run-100-in-your-browser-90</guid>
      <description>&lt;h1&gt;
  
  
  25 Free Developer Tools That Run 100% in Your Browser
&lt;/h1&gt;

&lt;p&gt;If you've ever pasted sensitive data into a random online tool and immediately regretted it — this post is for you.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;&lt;a href="https://toolsstack.cloud" rel="noopener noreferrer"&gt;toolsstack.cloud&lt;/a&gt;&lt;/strong&gt; — a collection of 25 free developer tools that run entirely in your browser. No backend. No account. No data ever leaves your device.&lt;/p&gt;

&lt;p&gt;Here's the full list, grouped by category.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔐 Security &amp;amp; Auth
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;a href="https://toolsstack.cloud/tools/jwt-decoder/" rel="noopener noreferrer"&gt;JWT Decoder&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Decode and inspect JWT tokens instantly — header, payload, expiry, and signature status. Useful when debugging auth issues without needing Postman or curl.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;a href="https://toolsstack.cloud/tools/hash-generator/" rel="noopener noreferrer"&gt;Hash Generator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Generate MD5, SHA-1, SHA-256, and SHA-512 hashes from any text. Client-side using the Web Crypto API — nothing is sent to a server.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;a href="https://toolsstack.cloud/tools/password-generator/" rel="noopener noreferrer"&gt;Password Generator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Cryptographically secure passwords with options for length, uppercase, numbers, symbols. Uses &lt;code&gt;crypto.getRandomValues()&lt;/code&gt; — genuinely random, not &lt;code&gt;Math.random()&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  📦 Data &amp;amp; Encoding
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4. &lt;a href="https://toolsstack.cloud/tools/json-formatter/" rel="noopener noreferrer"&gt;JSON Formatter &amp;amp; Validator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Format, minify, and validate JSON with syntax highlighting. Shows line numbers on errors. One of the most-used tools on the site.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;a href="https://toolsstack.cloud/tools/yaml-to-json/" rel="noopener noreferrer"&gt;YAML to JSON Converter&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Paste YAML, get JSON. Or paste JSON, get YAML. Uses &lt;code&gt;js-yaml&lt;/code&gt; — supports anchors, multiline strings, and nested objects. Great for switching between Kubernetes manifests and API calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. &lt;a href="https://toolsstack.cloud/tools/json-to-csv/" rel="noopener noreferrer"&gt;JSON to CSV Converter&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Convert JSON arrays to CSV with auto-detected column headers. Download the result directly. No server upload needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. &lt;a href="https://toolsstack.cloud/tools/base64-encoder/" rel="noopener noreferrer"&gt;Base64 Encoder / Decoder&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Encode and decode Base64 — supports both text and file input. Handles UTF-8 correctly (unlike some tools that break on non-ASCII characters).&lt;/p&gt;

&lt;h3&gt;
  
  
  8. &lt;a href="https://toolsstack.cloud/tools/image-to-base64/" rel="noopener noreferrer"&gt;Image to Base64&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Upload an image, get a Base64 data URI. Copy it directly into your HTML &lt;code&gt;&amp;lt;img src=""&amp;gt;&lt;/code&gt; or CSS &lt;code&gt;background-image&lt;/code&gt;. Useful for embedding small icons without an extra HTTP request.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. &lt;a href="https://toolsstack.cloud/tools/url-encoder/" rel="noopener noreferrer"&gt;URL Encoder &amp;amp; Decoder&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Encode/decode URLs with two modes: &lt;code&gt;encodeURIComponent&lt;/code&gt; (for query params) and &lt;code&gt;encodeURI&lt;/code&gt; (for full URLs). Also parses URLs into protocol, host, path, and individual query parameters.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. &lt;a href="https://toolsstack.cloud/tools/html-entity-encoder/" rel="noopener noreferrer"&gt;HTML Entity Encoder&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Encode special characters like &lt;code&gt;&amp;lt;&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;amp;&lt;/code&gt;, &lt;code&gt;"&lt;/code&gt; to HTML entities and back. Useful when embedding user-generated content or debugging XSS-safe output.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Developer Utilities
&lt;/h2&gt;

&lt;h3&gt;
  
  
  11. &lt;a href="https://toolsstack.cloud/tools/uuid-generator/" rel="noopener noreferrer"&gt;UUID / GUID Generator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Generate UUID v4 identifiers — single or bulk up to 1000 at once. Options for uppercase, no hyphens, or &lt;code&gt;{braces}&lt;/code&gt; GUID format. Uses &lt;code&gt;crypto.getRandomValues()&lt;/code&gt; for proper randomness.&lt;/p&gt;

&lt;h3&gt;
  
  
  12. &lt;a href="https://toolsstack.cloud/tools/regex-tester/" rel="noopener noreferrer"&gt;Regex Tester&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Test regular expressions with live match highlighting, group capture display, and flags support. Faster than switching between your editor and a browser console.&lt;/p&gt;

&lt;h3&gt;
  
  
  13. &lt;a href="https://toolsstack.cloud/tools/cron-generator/" rel="noopener noreferrer"&gt;Cron Expression Generator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Build cron schedules with a visual builder — minute, hour, day, month, weekday. Shows a human-readable description and the next 5 run times. No more googling "cron every 15 minutes".&lt;/p&gt;

&lt;h3&gt;
  
  
  14. &lt;a href="https://toolsstack.cloud/tools/epoch-converter/" rel="noopener noreferrer"&gt;Epoch / Unix Timestamp Converter&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Convert between Unix timestamps and human-readable dates in any timezone. Also shows the current timestamp live — useful for debugging API responses with &lt;code&gt;created_at&lt;/code&gt; fields.&lt;/p&gt;

&lt;h3&gt;
  
  
  15. &lt;a href="https://toolsstack.cloud/tools/diff-checker/" rel="noopener noreferrer"&gt;Diff Checker&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Paste two blocks of text and see line-by-line differences highlighted in green/red. Good for quickly spotting config file changes or API response differences.&lt;/p&gt;

&lt;h3&gt;
  
  
  16. &lt;a href="https://toolsstack.cloud/tools/chmod-calculator/" rel="noopener noreferrer"&gt;chmod Calculator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Click checkboxes for owner/group/others read/write/execute permissions and see the numeric value update in real time. Never google "chmod 755 meaning" again.&lt;/p&gt;




&lt;h2&gt;
  
  
  💻 Code &amp;amp; Text
&lt;/h2&gt;

&lt;h3&gt;
  
  
  17. &lt;a href="https://toolsstack.cloud/tools/css-minifier/" rel="noopener noreferrer"&gt;CSS Minifier&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Minify CSS and see the exact bytes saved. Pure client-side — paste your stylesheet, get minified output without uploading to any server.&lt;/p&gt;

&lt;h3&gt;
  
  
  18. &lt;a href="https://toolsstack.cloud/tools/sql-formatter/" rel="noopener noreferrer"&gt;SQL Formatter&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Format and beautify SQL queries with proper indentation and keyword casing. Supports SELECT, INSERT, UPDATE, DELETE, JOIN, subqueries.&lt;/p&gt;

&lt;h3&gt;
  
  
  19. &lt;a href="https://toolsstack.cloud/tools/markdown-converter/" rel="noopener noreferrer"&gt;Markdown Converter&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Convert Markdown to HTML with a live preview. Useful for checking how your README or blog post will render before committing.&lt;/p&gt;

&lt;h3&gt;
  
  
  20. &lt;a href="https://toolsstack.cloud/tools/lorem-ipsum-generator/" rel="noopener noreferrer"&gt;Lorem Ipsum Generator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Generate placeholder text — choose paragraphs, sentences, or word count. Starts with the classic "Lorem ipsum dolor sit amet" or generates fully random Latin-ish text.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎨 Design &amp;amp; Visual
&lt;/h2&gt;

&lt;h3&gt;
  
  
  21. &lt;a href="https://toolsstack.cloud/tools/color-converter/" rel="noopener noreferrer"&gt;Color Converter&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Convert between HEX, RGB, HSL, and HSV instantly with a color picker. Copy any format with one click. Useful when your designer gives you a hex code but your CSS needs HSL.&lt;/p&gt;

&lt;h3&gt;
  
  
  22. &lt;a href="https://toolsstack.cloud/tools/qr-code-generator/" rel="noopener noreferrer"&gt;QR Code Generator&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Generate QR codes for URLs, WiFi credentials, vCards, email, SMS, and more. Customize dot styles, eye shapes, colors, and upload a logo overlay. Download as PNG or SVG.&lt;/p&gt;




&lt;h2&gt;
  
  
  🌏 Specialized
&lt;/h2&gt;

&lt;h3&gt;
  
  
  23. &lt;a href="https://toolsstack.cloud/tools/ip-lookup/" rel="noopener noreferrer"&gt;IP Lookup&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Shows your public IPv4 and IPv6 addresses with geolocation (country, city, ISP). Useful for verifying VPN connections or debugging network issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  24. &lt;a href="https://toolsstack.cloud/tools/pdf-text-extractor/" rel="noopener noreferrer"&gt;PDF Text Extractor&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Extract text from PDF files — supports both English and Thai. Uses pdf.js entirely in the browser. The PDF never gets uploaded anywhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  25. &lt;a href="https://toolsstack.cloud/tools/thai-slug/" rel="noopener noreferrer"&gt;Thai Text to URL Slug&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Convert Thai text to URL-friendly slugs using RTGS (Royal Thai General System of Transcription) romanization. For example, &lt;code&gt;สวัสดีครับ&lt;/code&gt; → &lt;code&gt;sawatdi-khrap&lt;/code&gt;. Useful for building SEO-friendly URLs for Thai-language content.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why client-side only?
&lt;/h2&gt;

&lt;p&gt;Most online developer tools send your data to a server. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your JWT tokens (with user data) go to someone else's server&lt;/li&gt;
&lt;li&gt;Your passwords get logged somewhere&lt;/li&gt;
&lt;li&gt;Your internal API responses are stored in someone's database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every tool on toolsstack.cloud processes data entirely in your browser using native browser APIs (&lt;code&gt;crypto.getRandomValues&lt;/code&gt;, &lt;code&gt;URL&lt;/code&gt;, &lt;code&gt;FileReader&lt;/code&gt;) and trusted open-source CDN libraries like &lt;code&gt;js-yaml&lt;/code&gt; and &lt;code&gt;pdf.js&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The network tab stays clean.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;p&gt;Pure HTML + Vanilla JavaScript. No framework. No build step. No &lt;code&gt;node_modules&lt;/code&gt;. The entire site is static files on shared hosting.&lt;/p&gt;

&lt;p&gt;For tools that need libraries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;js-yaml@4.1.0&lt;/code&gt; — YAML parsing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pdf.js@4.4&lt;/code&gt; — PDF text extraction&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qrcode-generator&lt;/code&gt; — QR code rendering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else uses browser-native APIs.&lt;/p&gt;




&lt;p&gt;If you find any of these useful — or want to suggest a tool that's missing — feel free to drop a comment below.&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;&lt;a href="https://toolsstack.cloud" rel="noopener noreferrer"&gt;https://toolsstack.cloud&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tools</category>
      <category>javascript</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
